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In  recent  yours,  moments  and  flieir  uses  been  investigated  by  mathematicians, 

statisticians,  ami  engineers.  In  19S7,  tin'  American  Mathematical  Society  sponsored  a  short 
course  on  "Moments  in  Mathematics''  at  its  meeting  in  San  Antonio,  Texas.  This  led  to 
a  volume  containing  the  six  papers  delivered  there.  The  volume  was  published  by  the 
Society  in  its  Short  Course  Series  as  Volume  37  in  its  PnHTcdni i/s  of  Symposia  in  Applied 
Mathematics. 

Recently,  Dr.  James  Maar  of  the  National  Security  Agency  noted  a  number  of 
problems  in  signal  processing  m  which  moments  of  distributions  were  important  and  yet 
statisticians  and  signal  processor  scientists  were  unaware  of  what  had  been  accomplished 
by  each  other.  He  initiated  discussions  with  Professor  Peter  Purdue  of  the  Operations 
Research  Department  of  the  Naval  Postgraduate  School  and  Professor  Herbert  Solomon  of 
the  Statistics  Department  at  Stanford  University  about  developing  a  conference  in  which 
moments  and  signal  processing  and  their  interaction  would  be  featured.  Professor  Purdue 
and  Professor  Solomon  agreed  to  explore  this  idea  mid  they  developed  and  co-chaired  a 
Conference  on  Moments  and  Signal  Processing  which  was  held.  at.  the  Naval  Postgraduate 
School  on  March  30-31,  1992.  The  Proceedings  herein  resulted  from  that  conference. 

The  Conference  developed  around  eight  speakers  whose  interests  include  moments 
anti  statistics,  signal  processing,  and  interactions  between  the  two.  Professors  Jerry  Mendel 
and  Max  Xikius  came  from  the  signal  processing  community;  Professors  Satish  Iyengar  and 
Michael  Stephens  came  from  the  statistical  community.  The  remaining  four,  Professors 
David  Drillinger,  Ken-Shin  Lii.  Bruce  Lindsay,  and  Ed  Weginan,  came  at  the  subject  in 
different  shadings  emanating  from  the  central  core  of  the  Conference. 

The  Conference  was  supported  substantively  by  the  National  Security  Agency 
and  partially  by  the  Office  of  Naval  Research.  Many  thanks  are  due  to  these  agencies.  A 
number  of  government  scientists  from  the  Department  of  Defense  and  a  limited  number  of 
general  community  attendees  participated  in  th<*  Conference.  This  led  to  a  lively  audience 
of  40  to  50  participants  over  t  >e  two  day  period. 

It  is  hoped  that  the  wi  le  availability  of  the  papers  in  this  report  will  le->d  *o  more 
communication  between  the  two  communities  and  of  course  within  each  group. 
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ABSTRACT 


This  tutorial  paper  is  focused  on  two  topics,  namely:  (i)  tar  describe  system¬ 
atic  methodologies  for  selecting  nonlinear  transformations  for  blind  equal¬ 
ization  algorithms  (and  thus  new  types  of  cumulants),  and  (ii)  to  give  an 
overview  of  the  existing  blind  equalization  algorithms  and  point  out  their 
strengths  as  well  as  weaknesses.  It  is  shown  in  this  paper  that  all  blind 
equalization  algorithms  belong  in  one  of  the  following  three  categories,  de¬ 
pending  where  the  nonlinear  transformation  is  being  applied  on  the  data: 
(i)  the  Bussgang  algorithms,  whore  the  nonlinearity  is  in  the  output  of  the 
adaptive  equalization  fdter;  (ii)  the  polyspectra  (or  Higher-Order  Spectra) 
algorithms,  where  the  nonlinearity  is  in  the  input  of  the  adaptive  equal 
ization  filter;  and  (iii)  the  algorithms  where  the  nonlinearity  is  inside  the 
adaptive  filter,  i.e.,  the  nonlinear  filter  or  neural  network.  We  describe 
methodologies  for  selecting  nonlinear  transformations  based  on  various  op¬ 
timality  criteria  such  as  MSE  or  MAP.  We  illustrate  that  such  existing  al¬ 
gorithms  as  Sato,  Benveniste-Goursat,  Godard  or  CMA,  Stop-and-Go  and 
Donoho  are  indeed  special  cases  of  the  Bussgang  family  of  techniques  when 
the  nonlinearity  is  memoryless.  We  present  results  that  demonstrate  the 
polyspectra-based  algorithms  exhibit  faster  convergence  rate  than  Bussgang 
algorithms.  However,  this  improved  performance  is  at  the  expense  of  more 
computations  per  iteration.  We  also  show  that  blind  equalizers  based  on 
nonlinear  filters  or  neural  networks  are  more  suited  for  channels  that  have 
nonlinear  distortions. 

The  Godard  or  CMA  algorithm  is  probably  the  most  widely  used  blind 
equalizer  in  digital  communications  today  due  to  its  simplicity,  low  complex¬ 
ity  and  constant  modulus  property.  Its  main  drawbacks,  however,  are  slow 
convergence  and  no  guarantee  for  global  convergence  starting  from  arbitrary 
initial  guess.  We  present  anew  method  for  blind  equalization,  the  CRIMNO 
algorithm  ( i.e.,  criterion  with  memory  nonlinearity),  which  is  shown  to  have 
the  same  advantages  as  Godard  (simplicity,  low  complexity,  constant  modu¬ 
lus  property)  and  yet  guaranteeing  much  faster  convergence.  The  CRIMNO 
algorithm  is  flexible  enough  to  address  blind  deconvolution  problems  when 
the  input  sequence  is  colored. 
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1  INTRODUCTION 


Blind  deconvolution  or  equalization  is  a  signal  processing  procedure  that  recovers  the  input 
sequence  applied  to  a  linear  time-invariant  nonminimum  phase  system  from  its  output  only. 
Blind  equalization  algorithms  are  essentially  adaptive  filtering  algorithms  designed  in  such  a  way 
that  they  do  not  need  the  external  supply  of  a  desired  response  to  generate  the  error  signal  >n 
the  output  of  the  adaptive  filter.  In  other  words,  the  adaptive  algorithm  is  ‘‘blind”  to  the  desired 
response.  However,  the  algorithm  itself  generates  the  desired  response  by  applying  a  nonlinear 
transformation  on  sequences  involved  in  the  adaptation  process.  All  blind  equalization  algorithms 
belong  to  one  of  the  following  three  categories,  depending  where  the  nonlinear  transformation  is 
being  applied  on  the  data: 

•  The  Bussgang  algorithms,  where  the  nonlinearity  is  in  the  output  of  the  adaptive  equal¬ 
ization  filter; 

•  The  Polyspectra  (or  Higher-Order  Spectra)  algorithms,  where  the  nonlinearity  is  in  the 
input  of  the  adaptive  equalization  filter; 

•  The  algorithms  where  the  nonlinearity  is  inside  the  adaptive  filter;  i.e.,  the  filter  is  non¬ 
linear  ( e.g .  Volterra)  or  neural  network. 

The  purpose  of  this  paper  is  to  provide  an  overview  of  the  existing  blind  equalization  algo¬ 
rithms  and  to  discuss  their  advantages  and  limitations.  Conventional  equalization  and  carrier 
recovery  techniques  used  in  multilevel  digital  communication  systems  usually  require  an  initial 
training  period,  during  which  a  known  data  sequence  (i.e.,  training  sequence)  is  transmitted  [43], 
[45].  An  alternative  effective  approach  to  this  problem  is  to  utilize  blind  equalizers  which  do  not 
require  any  known  training  sequence  during  the  startup  period. 


4 


The  paper  describes  systematic  methodologies  for  selecting  the  nonlinearity  based  on  various 
optimality  criteria,  such  as  maximum  likelihood  (ML),  mean-square  error  (MSE)  or  maximum 
a  posteriori  (MAP).  As  an  example,  it  is  illustrated  that  such  existing  algorithms  as  Sato  [46], 
[47]  Benveniste-Goursat  [5],  [6]  Godard  or  CMA  [22],  [50]  and  Stop-and-Go  [41]  are  indeed  spe¬ 
cial  cases  of  the  family  of  Bussgang  techniques  where  the  nonlinearity  is  memoryless  [3],  [4],  It 
is  demonstrated  that  the  polyspectra-based  algorithms  exhibit  faster  convergence  rate  than  the 
Bussgang  algorithms.  However,  this  improved  performance  is  at  the  expense  of  more  computa¬ 
tional  complexity.  On  the  other  hand,  blind  equalizers  based  on  nonlinear  filters  are  well  suited 
for  channels  that  have  nonlinear  distortions  [39],  [40]. 

The  Godard  algorithm  is  probably  the  most  widely  used  blind  equalizer  in  digital  communica¬ 
tions  today  due  to  its  simplicity,  low  computational  complexity,  and  constant  modulus  property. 
Its  main  drawbacks,  however,  is  slow  convergence  and  no  guarantee  for  global  convergence  (con¬ 
vergence  starting  from  arbitrary  initial  guess).  The  paper  describes  the  development  of  the 
CRIMNO  algorithm  (i.e.,  criterion  with  memory  nonlinearity)  which  is  shown  to  have  the  same 
advantages  as  Godard  algorithm  (simplicity,  low  complexity,  constant  modulus  property)  and  yet 
guaranteeing  much  faster  convergence  [12],  [13].  Extension  of  the  CRIMNO  algorithm  to  the  case 
of  colored  input  signals  is  also  presented. 

The  polyspectra-based  adaptive  blind  equalization  algorithms  are  also  described  in  the  pa¬ 
per.  In  particular,  the  Tricepstrum  Equalization  Algorithm  (TEA)  [24],  the  Power  Cepst.rum 
and  Tricoherence  Equalization  Algorithm  (POTEA)  [7],  and  the  Cross-Tricepstrum  Equalization 
Algorithm  (CTEA)  [8]  are  presented,  as  well  as  their  advantages  and  limitations.  It  is  shown 
that  these  algorithms  perform  simultaneous  identification  and  equalization  of  a  nonminimum 
phase  communication  channel  from  its  output  only.  Simulations  with  PAM  and  QAM  signals 
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demonstrate  the  effectiveness  of  the  polyspectra- based  algorithms. 

Finally,  the  paper  provides  an  overview  of  the  neural  network  based  adaptive  equalization 
algorithms  either  with  or  without  a  training  sequence  [  1 1  j ,  [20],  [20],  [27],  [30],  [-10] ,  [-10]. 

2  DEFINITION  OF  BLIND  EQUALIZATION  PROBLEM 

Let  us  consider  the  discrete- time  linear  transmission  channel  whose  impulse  response  {/(/)}  is 
unknown  and  possibly  time- varying.  The  input  data  { ar ( f ) }  are  assumed  to  be  independent  and 
identically  distributed  (i.i.d.)  random  variables,  with  non-Gaussian  probability  density  function. 
Let  us  also  assume,  without  loss  of  generality,  that  the  sequence  { x ( z ) }  has  mean  K{x{i)}  —  0 
and  variance  £’{|x(t)|*}  =  Q If  i(i)  is  real,  we  may  drop  the  magnitude  function  and  simply 
write  Initially,  noise  is  not  taken  into  account  in  the  output  of  the  channel.  From 

Figure  2.1.  it  follows  that  the  model  we  consider  is 


1/(0  =  /( 0  *  *(0 

=  £/(*)  xli-k)  (2.1) 

k 

where  “**’  denotes  linear  convolution  and  {</(/)}  is  the  received  sequence.  The  problem  is  to  recon¬ 
struct  (or  restore)  the  input  sequence  {x(/)}  from  the  received  sequence  {y(i)}  or,  equivalently, 
to  identify  the  inverse  filter  (equalizer)  {</(/)}  for  the  channel. 

From  Figure  2.1,  we  see  that  the  output  sequence  {x(/)}  of  the  equalizer  is  given  by 


x(i)  =  u(0  *  y(0 
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u(i)  *  ( /( i )  *  r(i)) 


-  'i(i)  *  /(')  *  •!■(')•  (2.2 j 


So,  to  achieve 

i(i)  —  x(i-D)eJt}  (2.3) 

where  D  is  a  constant  delay  and  8  is  a  constant  phase  shift,  it  is  required  that 


u(i)  *  /(»)  =  8(i  -  D)  (2,n 


where 


8(t) 


(l)(f^),  i  =  0 

0,  otherwise. 


Performing  the  Fourier  transform  on  (2.4),  we  obtain 


U(u>)  ■  F(w) 


dO-u,D) 

eJ 


(2.5) 


In  other  words,  the  objective  of  the  equalizer  is  to  achieve  a  transfer  function 


U(w) 


1  ,{9-wD) 

- 

F(w) 


(2.6) 


In  general,  D  and  9  are  unknown.  However,  the  constant  delay  D  does  not  affect  the  reconstruc¬ 
tion  of  the  original  input  sequence  (x(i)}.  The  constant  phase  shift  9  can  be  removed  by  a  carry 
recovery  technique.  As  such,  in  the  sequel,  it  will  be  assumed  that  D  =  0  and  0  =  0. 

•  Blind  equalization  schemes  may  be  classified  into  three  categories;  he.,  those  which  utilize 


4 


7 


nonlinearities  in  the  output  of  the  adaptive  equali/at ion  filter,  those  which  place  the  nonlinearitv 
in  the  input  of  t  lie  adaptive  equalization  filter,  and  t hose  which  utilize  adaptive  nonlinear  equal¬ 
ization  filters.  I  he  Bussgang  equalization  algorithms  with  metnorvlcss  or  rnemorv  nonlinearity 
belong  to  the  first  category  whereas  the  higher-order  cumulant-hased  equalizers  (TEA.  POTEA, 
etc.)  belong  to  the  second  category,  as  they  perforin  memory  nonlinear  transformation  on  the 
input  da;  a  of  the  equalization  filter.  Blind  equalizers  ba^ed  on  nonlinear  filters,  such  as  the 
Yolterra  filter  or  neural  networks,  belong  to  the  third  category.  Figures  2.2  (a)-(c)  illustrate  the 
block  diagrams  of  the  aforementioned  three  families  of  blind  equalizers. 

3  PERFORMANCE  MEASURES  FOR  ALGORITHM  EVAL¬ 
UATION 

Four  different  performance  measures  are  usually  considered  in  simulation  experiments  for  the 
testing  of  the  blind  equalization  algorithms:  the  time-average  squared  error  (E.\spF  1  he  tran¬ 
sitional  symbol  error  rate  (SF.R),  the  residual  intersymbol  interference  ( IS  1 )  and  the  discrete  eye 
patterns  [-13],  [-14].  They  are  defined  as  follows. 

Time- Average  Squared  Error(E  ASE  or  MSE) 

At  iteration  (i),  the  mean  square  error  in  the  output  of  the  equalizer  is  defined  as  : 

1  ‘V 

Ease  =  ■yzL^'-  ^ Ol2  (3.1) 

t  —  1 

where  x(i)  is  the  output  of  the  equalizer  at  iteration  (i)  and  x(?  —  D)  is  the  corresponding  true 
value.  Note  that  the  delay  D,  which  is  introduced  by  the  channel  and  the  equalizer,  does  not 
affect  the  recovery  of  the  original  information  {x(/)}.  However,  it  must  be  taken  into  account  in 
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the  calculation  ot  MSI'.  (;)•  The  MSI:!  (i)  gives  a  measure  of  both  the  noise  and  residual  IS1  at 
the  output  of  the  equalizer. 

Transitional  Symbol  Error  Rate  (SER) 

I  he  SLR  indicates  the  percentage  ol  wrongly  detected  symbols  in  consecutive  intervals  of  500 
symbols.  ;.c.. 


SLR 


wrong  detections  in  500  symbols 
500 


(3.2) 


Residual  ISI 

I  he  residuaJ  ISI  in  the  output  of  equalizer  is  defined  as  f<  ’ows.  Let  { / ( i  )}  be  the  channel  impulse 
response  and  {'*(;)}  the  equalizer  tap  coefficients  at  iteration  (t)  Let  s{i)  -  f[i )*  u(i),  then 


ISI(t) 


T.,  \Hl)\2  ~  max{  |.s( P]2} 
max{|.s(()|2} 


(3.3) 


Physically,  this  indicates  the  amount  of  ISI  present  at  the  output  of  the  equalizer  due  to  imperfect 
equalization. 

Discrete  eye  patterns 

Discrete  eye  patterns  (or  equalized  signal  constellation)  consist  of  all  possible  values  of  the  outpiP 
of  the  equalizer,  i(i),  at  iteration  (i),  drawn  in  two-dimensional  space.  We  say  that  the 
eye  pattern  is  open  when"ver  the  ideal  decoding  thresnolds  arc  easily  distinguishable  between 
neighboring  equalized  states. 

In  our  simulations,  all  performance  measures  were  calculated  for  many  independent  signal 
and  noise  realizations.  For  the  E.yqp,  time  averaging  over  100  samples  were  performed  for  each 
rea.ization.  The  eye  pattern  at  iteration  (i)  was  obtained  by  drawing  the  output  of  eq;  alizer  for  all 
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independent  realizations  and  for  a  specific  number  of  samples  (for  each  realization)  symmetrically 
located  around  (i). 

4  ALGORITHMS  WITH  NONLINEARITY  IN  THE  OUT¬ 
PUT  OF  THE  EQUALIZATION  FILTER 

Let  us  assume  that  a  guess  for  the  impulse  response  of  the  inverse  filter  (equalizer),  ug(i)  has 
been  selected.  Then, 


ug{i)  *  f(i)  =  6(i)  +  c(i)  (4.1) 

where  e (i)  accounts  for  the  difference  (error)  between  our  guess  ug(i)  and  the  actual  values  of 
u(i).  If  we  convolve  the  initial  guess  of  the  inverse  filter,  {up(i)},  wifh  the  received  sequence, 
{y(i)},  we  obtain 


x(i)  =  y(i)  *  ug(i ) 

=  x(i)  *  f{i)  *  ug(i).  (4.2) 

Combining  (4.2)  with(4.1),  we  obtain 

x  =  x(i)  *  (6(t)  +  e(i)) 

=  [*(*)  *  ^(*)] +  [*(*)  *  «(*)] 

=  x(i)  +  n(i)  (4.3) 
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where 


n{i.)  =  x(i)  *  e(i)  (4.4) 

is  the  “convolutional  noise",  namely,  the  residual  ISI  arising  from  the  difference  between  our 
guess  ug(i)  and  the  actual  inverse  filter  u(i). 

Our  problem  now  is  to  utilize  the  deconvolved  sequence  {£(2)}  to  find  the  “best”  estimate  of 
{£(«)};  namely,  {d(i)}.  Note  that  in  adaptive-filter  literature  d(i)  is  used  to  represent  the  desired 
response  [25].  Two  criteria  are  employed  to  determine  the  “best”  estimate  of  x(i)  from  the  given 
x(i)  .  These  are  the  mean-square  error  (MSE)  and  maximum  a  posteriori  (MAP). 

Since  the  transmitted  sequence  x(i)  has  a  non-Gaussian  probability  density  function,  the  MSE 
and  MAP  estimates  are  nonlinear  transformations  of  x(i).  In  general,  the  “best”  estimate  d(i)  is 
given  by  [3],  [4],  [23],  [54]. 

d(i)  -  <7 [£(«')]  (memoryless) 


or 


d(i)  =  g[x(i),x(i  -  1), . . x(i  -  m)]  (mth  -  order  memory)  (4.5) 

where  </[•]  is  a  nonlinear  function  with  or  without  memory.  The  d(i)  is  fed  back  into  the  adaptive 
equalization  filter  as  shown  in  Figure  4.1.  From  Uiis  figure,  it  is  also  apparent  that  the  nonlinear 
function  #[•]  is  in  the  output  of  the  equalization  filter. 


4.1  Optimum  Selection  of  Nonlinearities 

4.1.1  Nonlinearities  with  MSE  Estimates 

In  summary,  a  well  treated  classical  estimation  problem  is  as  follows: 
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x(i)  =  x(i)+n(i) 


(4.6) 


where 

(i)  n(i)  is  Gaussian.  Note  that  if  e(t)  in  (4.4)  is  long  enough,  the  central  limit  theorem  makes 
the  Gaussianity  assumption  for  n(i)  reasonable. 

(ii)  {x(i)}  are  independent,  identically  distributed  (i.i.d.)  and  in  generai  non-Gaussian.  The 
pdf  of  x(i)  is  known;  in  digital  communications  the  {x(  i ) }  are  usually  equi-probable  discrete 
signal  points. 

(iii)  x(i)  and  n(i)  are  assumed  independent. 

Given  the  x(i),  we  seek  the  MSE  estimate  of  a: ( £ ) ,  namely,  dmse(*)- 

From  Van  Trees  [52,  p.  58],  it  follows  that  the  best  MSE  estimate  of  {x(i)}  given  (x(z)}  is 
the  mean  of  the  a  posteriori  density,  i.e., 

/+oo 

dx  xPx/i(x/x) 

•OO 

=  E{x{i)jx(i)}.  (4.7) 

where  Px/x(x/x)  =  js  the  a  posteriori  density;  PI/i(x/x)  is  Gaussian,  N(x(i),Qn), 

with  Qn  being  the  variance  of  {n(i)};  the  a  priori  density  Px{x)  is  the  pdf  of  x(i),  and  PT/i(x) 
behaves  as  a  normalization  constant  in  the  integral  of  (4.7). 

If  x{i)  is  zero-mean  Gaussian  with  variance  Qx\  i.e.,  Px(x)  is  N{0,Qx),  (4.7)  reduces  to 
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^mse(i)  -  — —  * y.  x(i)  (4-8) 

V  x  '  n 

which,  in  turn,  implies  that  g[x(i)]  is  a  linear  function.  However,  when  Px(x)  is  non-Gaussian, 
the  integral  (4.7)  can  not  be  reduced  to  a  simple  expression  and  g[-]  will  be  a  nonlinear  function. 
In  the  sequel,  we  show  dmse{i)  versus  x(i)  when  pdf  Px(x)  is  uniform  and  Laplace. 

Uniform  Distribution 


The  a  priori  pdf  is  given  by 


PAx)  = 


^  —A  <  x  <  A 
0,  otherwise. 


Consequently,  the  a  posteriori  pdf  takes  the  form 


Px/Ax/x)  = 


(x.f) 

8t(x) 


—A  <  x  <  A 


otherwise. 


(4.10) 


where 


„  ^  1  1  (*  -  i)2 

Al(x'x)  “  2Av/2^eXP  2  Qn 


Bi(x)  =  J  A\(x)dx. 


Substituting  (4.10)  into  (4.7).  we  obtain  dmse(*)  as  a  function  of  x.  However,  this  relationship  is 
not  easy  to  express  analytically  and  is  obtained  by  numerical  integration  as  shown  in  Figure  4.2. 
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Laplace  Distribution 


The  a  priori  density  is  given  by 


Px{x)  =  ^exp[-A|i|] 

and  thus  the  a  posteriori  density  takes  the  form 


(4.11) 


where 


K/i -A*/x) 


A2(x<x) 

B2(i) 


B2(i) 


A 

2  ‘  eXp 


r 


A2(x)dx. 


1 

V'2*Qn 


exp[- 


(x~i)2 

2Qn 


(4.12) 


Combining  (4.12)  with  (4.7)  and  using  numerical  integration  we  obtain  se  vs  i  as  shown  in 
Figure  4.3. 


4.1.2  Nonlinearities  with  MAP  Estimates 

In  this  section  we  treat  the  estimation  problem 


x(i)  =  x(i)+n(i) 


where  n(i)  is  Gaussian  and  x(i)  is  i.i.d.  non-Gaussian.  However,  we  seek  MAP  estimate  of  x(i). 
namely  dmap(i)  when  n(i)  is  white  or  colored,  or  correlated  with  x(i).  The  colored  noise  case. 
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as  well  as  the  case  of  correlated  noise  with  x(i).  will  result  into  a  memory  nonlinear  relationship 
between  dmap  and  x(i);  he.,  r/map(0  =  g[x(/),x(i  -  l),...,x(i  —  m)].  If  x(i)  is  Gaussian  i.i.d. 
and  n(i)  is  white  Gaussian,  independent  from  x(i),  then  the  dmap(0  is  identical  to  dmse(>)  and 
is  given  by  (4.8). 

If  we  denote  x  =  [x(/),  x(i  -  1), . . . ,  ar ( 1 )]  and  x  =  [x(i).  i(i  -  1), . . . ,  x(  1)],  then  a  posteriori 
pdf  is  given  by  Van  Trees  [p.  58] 


Pr/rU/i) 


PAD  ■  PijAUD 

P{D 


(4.13] 


and  the  MAP  estimate,  dmap,  °f  £.  given  x  is  the  value  of  x  which  maximizes  £(x),  where 


C(x)  =  (nPi/£(i/x)  +  fnP£(r).  (4.14) 

where  the  denominator  of  (4.13)  does  not  contribute  to  the  maximization  of  f(x). 

CASE  I:  White  Gaussian  Noise 

In  this  case  the  n(i)  is  white,  Gaussian  N(0,Q„),  and  independent  of  x(i).  It  is  also  assumed 
that  {x(z)}  are  i.i.d.  and  non-Gaussian.  Consequently,  joint  pdfs  are  expressed  as  products  of 
marginal  pdfs  and  the  MAP  estimate  at  each  iteration  {«},  dma.p(i),  is  obtained  by  maximizing 

£(x(f))  =  (-nPi/Ax/x)  +  CriPAD- 

That  is  to  say  that  the  estimation  problem  is  decoupled  and  the  resulting  relationship 
^map(i)  vs  x(t),  is  memoryless. 

The  following  memoryless  nonlinearities  can  be  derived. 
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(i)  Uniform  Dist  rilnit ion  (1.4) 


-  A, 

xU)  <  -A 

d  inapt ' ) 

< 

VI 

'X 

VI 

-<■ 

1 

(Fir, 

A, 

XU)  >  A 

Nolo  that  (/mai)  dot's  mi i  depend  on  Q„. 
(ii)  Laplace  Distribution  (1.11) 


</map(  0 


x(')  +  \Qn,  x(i)  <  -\Qn 
0,  -\QU  <  x{i)  <  \Qn 

x(i)-\Qn,  i(»)  >  \Qn. 


(4.1b) 


Hero  the  MAP  estimate  depends  on  Qn.  For  the  symmetric  uniform  and  Laplace  a  priori  distri¬ 
butions  the  resulting  a  posteriori  pdf,  /;yr(  x/x).  is  asymmetric. 

Figures  1.4  and  -Lf)  illustrate  the  MAP  memoryless  nonlinearities. 

CASE  II:  Colored  Gaussian  Noise 


In  this  case  we  assume  that  n(i)  is  colored  Gaussian  N{0,ff)  where  If  is  m  x  rn  correlation 
matrix.  On  the  other  hand,  {«(/)}.  Based  on  these  assumptions,  the  numerator  of  (4.13)  is 


where 


Fr(x)  •  PijAl/X.) 


nw)) 


•  %r(S./x) 


PijAL/z.) 


(4.17) 
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and 

7ft 

fll'AAi))  -  [Ovt-Op. 

i = i 

For  mathematical  tractahilit  v,  we  consider  t!ie  case  in  —  2  and  derive  the  memory  nonlinear 
relationships  din;ip(<)  vs  r. 

For  m  =  2  the  correlation  matrix  takes  the  form 


R  =  Qn 


0 


P  1 


bl  <  i- 


For  simplicity,  we  also  define  the  following  vectors 


(  \ 

/  \ 

•Cl 

A 

lf2> 

/  \ 

/  \ 

A 

7(0 

.X2  > 

(i)  Uniform  Distribution  (4.9) 

Maximizing  (4.17)  is  equivalent  here  to  minimizing 


(4.18) 


(4.19) 


J  —  (i~x)TR  1  (i  —  i)  (4.20) 

with  the  restrictions  —A  <  xi  <  A,  —A  <  x2  <  A.  Hence,  we  seek  a  point  in  the  area 
A’ 2  =  {(xi,x2)  :  —A  <  xx  <  A,  —A  <  x2  <  A}  such  that  J  is  minimized.  Differentiating  J 
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with  respect  to  .r i  ami  x>  and  setting  tin*  derivative  to  zero  we  obtain 


Ui 

J-,)  />(XJ  Xj) 

~  l) 

Xj)  -  />(>!  -  X!  ) 

—  0. 

(4.21) 

From  (1.21).  it  is  apparent  that,  if  .h  X  that  is  -A  ^  J-,  <  A  and  -A  <  x.  <  A,  then 


when  j;  is  outside  A  • ,  the  minimum  is  achieved  on  the  boundary  of  X  <.  That  is 

<Amap  =  k  ■  \  ■  sgn(ii]  +  tl  -  k)fc[x ,  -  p(f>  -  Asgn[.r2])l 
■-imap  =  ( i  -  k)  ■  A  •  sgn(i.»]  f  fc  -  fc[x,  -  p(xt  -  Asgn[i,])] 

for  X  /.Y2  (4.23) 


J  =  A|i-i|  +  A|x2|  +  i[(x  -  x)TR  x(x  -  x)j. 


(4.25) 
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The  necessary  conditions  are 


Asgn[i]]  +  c(xi  -  £i)  -  cp(x2  -  x2)  =  0 
Asgn[x2]  +  c(x2  -  x2)  ~  cp{x\  -  ii)  =  0. 


(4.26) 


where  c  =  q  .  Clearly,  (4.26)  is  a  nonlinear  system  of  equations.  Two  special  cases 

are  the  following:  1)  when  -A/c  <  xx  —  pf2,  ii  ~  pi  1  <  A /c,  then  dimap  =  0,  and  2) 
when  p  —  0,  the  problem  reduces  to  the  case  of  white  Gaussian  noise. 

4.2  The  Bussgang  Algorithms 

Fig.  4.1  illustrates  the  Bussgang  adaptive  blind  equalization  algorithms  when  an  LMS  type  or 
stochastic  gradient  algorithm  [53]  is  used  for  the  adaptation  of  the  equalizer  coefficients,  and  the 
nonlinearity  ^(')[.]  is  memoryless  [3],  [4],  [23],  The  following  equations,  consistent  with  the  block 
diagram  of  Fig.  4.1,  describe  the  Bussgang  family  of  algorithms: 


«(*)  =  [ui(i)>-  ,.,uN(i)]T 

equalizer  taps 

n(0)  =  [0,...,l,...,0]r 

initial  tap  values 

J/(0  =  [3/(0* ••■>»(*-  N+  l)]r 

input  to  the  equalizer  block  of  data 

i  =  0,1,2,... 

iteration  index 

x(i)  -  uH{i)y(i) 

equalizer  output  or  reconstructed  sequence 

d(i)  =  g(l)[x(i)]  =  g{,)[uH  (i)y(i)} 

output  of  nonlinearity 

e(  i )  =  d(i)  —  x(i) 

error  sequence 

u(i+  1)  =  u(i)  +  ny(i)  ■  e'{i) 

LMS-type  adaptation 

(4.27) 
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4.2.1  Convergence  Rate  and  Properties 

From  (4.27)  and  Figure  4.1,  it  is  apparent  that  the  output  sequence  of  the  nonlinear  function, 
</(/).  “plays  the  role”  of  the  desired  response  or  the  training  sequence.  It  is  also 
apparent  that  t he  Bussgang  technique  is  simple  to  implement  and  understand,  and  it  may  be 
viewed  as  a  minor  modification  of  the  original  LMS  algorithm  (the  desired  response  of  the  original 
LMS  adaptation  is  a  memoryless  transformation  of  the  transversal  filter  output).  As  such,  it  is 
expected  that  the  technique  will  have  convergence  that  will  depend  on  the  eigenvalue  spread  of 
the  autocorrelation  matrix  of  the  received  data  {;/(;)}• 

From  (4.27),  the  LMS  adaptation  equation  for  the  equalizer  coefficients  is  given  by 

u(i+  1)  =  «(*)  +  MV(t)  e*(z)  (4.2S) 

If  we  obtain  the  expected  value  (ensemble  averaging)  of  (4. 28),  we  have 

£{w(i  +  l)}  -  E{u(i)}  +  iiE  {y(i)  (gur[x(i)}- 

=  E {«(*)}  +  l-^E  [y{i)g(,y[x(i)]j  -  ^E{y(i)xm(i)}.  (4.29) 

The  adaptive  algorithm  converges  in  the  mean  when 

£'{i/(0'7(‘)*[*(0]}  =  E{y(i)x‘(i)}  (equilibrium) 

and  it  converges  in  the  mean-square  when 

E  [uir{i)y(i)i 7(,,*[x(r)]}  =  E{uH{i)y{i)xm(i)} 
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£{5(*Vr[5(i)]}  =  E{x(i)iV)}- 


( 4 .30) 


Thus,  it  is  required  that  the  equalizer  output  i(i)  be  Bussgang  at  equilibrium. 
Note  that  identity  (4.30)  states  that  the  autocorrelation  of  x(i)  (right-hand  side)  equals  the 
cross  correlation  between  x(i)  and  a  nonlinear  transformation  of  x(t)  (left-hand  side).  Processes 
which  satisfy  property  (4.30)  are  said  to  be  Bussgang  [10].  In  summary,  the  adaptive  Bussgang 
techniques  converge  when  the  equalizer  output  sequence,  (x(<)},  becomes  Bussgang  (necessary 
condition). 

A  stochastic  gradient  algorithm  (steepest  descent)  essentially  minimizes  iteratively  a  perfor¬ 
mance  index  J(i)  =  £{G[x(i)]}  with  respect  to  the  equalizer  coefficients  u(i).  A  more  general 
form  of  the  equalizer  taps  adaptation  equation  (4.28)  is  [25] 

«(*  +  1)  =  «(»')-  (4.31) 

where  Vu7(i)  is  the  gradient  of  J(i).  Differentiating  J(i)  by  using  the  composite  function  rule, 
we  obtain 


VuJ(i)  =  -£{V„[x(i)]  •  V*[G(x(i))]} 


=  -^{j/CO-VxRxO))]} 


(4.32) 


By  dropping  the  expectation  operation,  i.e.,  by  using  a  single-point  unbiased  estimate, 
we  obtain 
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VuJ(  0 


]/( *>*(*) 


(4.33) 


where 

e'(t)  =  Vi[G(i(0)] 

=  fl(,r[z(i)J  -  x’(i)  (4.34) 

Equation  (4.3.4)  shows  the  relationship  between  the  nonlinear  function  <7*')[  ]  used  in  the  Bussgang 
Techniques  with  the  nonlinear  cost  function  G[-]  which  defines  the  performance  index,  J[-]. 
Example  for  one-dimensional  modulation  (PAM) 

The  first  blind  equalixation  algorithm  was  introduced  by  Sato  in  1975  [47]  for  PAM  signals.  He 
chose  the  simple  nonlinear  function 

g{x)  =  7Sgn[x]  (4.35) 

where  7  is  a  gain  parameter  which  must  be  chosen  to  satisfy  the  Bussgang  property  (4.30)  i.e., 

£{i(i)-7sgn[i(i)]}  =  £{|x(i)|2} 

or 

7  =  £{x(t)|2}/£'{|x(i)|}.  (4.36) 
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We  could  also  write  Sato’s  algorithm  in  terms  of 

1  o 

G(x)  (4.37) 

4.2.2  Extension  to  QAM  modulation 

The  extension  of  Bussgang  algorithms  to  two-dimensional  constellations  (QAM)  is  somewhat 
straightforward  [3],  [4].  In  the  case  of  tu<  independent  quadrature  carrers,  the  conditional 
mean  estimate  of  an  equivalent  complex  transmitted  symbol  i  given  the  complex  observation 
x  —  xr  +  ji[  can  be  written  as 


d  =  E{x  /x}  =  g[xR]  +  jg[xi).  (4.38) 

We  keep  the  notation  simple  by  omitting  (i).  For  example,  ’he  Sato  nonlinearity  for  QAM  signals 
takes  the  form  [47]. 


g(x)  =  7csgn(x)  =  7{sgn[iR]  +  j  sgn[x/]}. 


(4.39'* 


It  is  clear  that  real  and  imaginary  parts  of  the  data  can  be  estimated  separately.  The  complex 
data,  equivalent  of  the  adaptive  Bussgang  Techniques  is  described  in  (4.27),  but  with 


g{l)[x(i)}  =  g^[xR(i)}  +  j  g^[x,(i)}. 


(4.40) 


Consequently,  the  error  sequence  is 


e(0  =  {g{,)[iR(i)}- XR(i)}  r  j  {g{l}[ii{i)}~  x,(i)}  . 


(4-41) 
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For  example,  the  "Stop-and-Go”  algorithm  introduced  by  Picchi  and  Prati  [41]  is  an  adaptive 
Bussgang  technique  with  the  following  nonlinearity 


$[*(*)]  =  *(*)+  ^Ax{i) 

-  \Bx-{i)  (4.42) 

where  x(i)  is  defined  as  the  quantizer  (slicer)  output  in  Figure  4.1  and  (A,  B )  is  a  pair  Of  integers 
taking  values  (2,0)  or  (1,1)  or  (1,-1)  or  (0,0).  The  values  of  (A, B)  are  generally  different  at 
each  iteration,  and  how  they  are  chosen  is  described  later  in  this  section. 

Another  example  of  a  Bussgang  technique  is  the  heuristic  modification  of  the  Sato  algo¬ 
rithm  suggested  by  Benveniste  and  Goursat  [5],  [6].  In  this  case,  the  nonlinear  function  takes  the 
form 


$[x(t)]  =  x(i)  +  k\i{i)  -  fcix(t)  + 

fc2|x(t)  -  x(i)|  •  [7csgn[x(t)]  -  x(i)] 


or 


<7(x(z)]  =  x(*)  +  |x(*)-x(i)| 

.  Ar2[7csgn[x(i)]  -  x(t)]}  (4.43) 


where  fc2,  fc2  are  constants.  From  (4.38)  we  observe  that  the  Benvenistc-Goursat  error  function 
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may  bo  soon  as  a  weighted  sum  of  the  Decision  Directed  (1)1))  [43]  ami  Sato  orrors.  On  tho 
othor  liaiul  tho  •Stop-and-Uo”  error  function  ( -1 .07 )  is  tho  weighted  sum  of  tho  1)1)  orror  ami 
its  conjugate.  Tho  weights  of  the  two  algorithms,  however,  are  chosen  in  a  completely  different 
manner. 

4.2.3  Unknown  Carrier  Phase:  The  Constant  Modulus  Property 

Equation  (-1  33)  can  be  written  in  polar  coordinates  as 


d  = 


(-1,11) 


If  we  assume  that  all  rotated  constellations  are  equally  likely,  since  the  carrier  phase  is 
unknown,  then  the  conditional  mean  d  in  (4.39)  has  the  same  argument  as  x,  and  is  given  by 


d  =  j[|i|]  •  eJ  "*<*>  (4.45) 

where  (/[•]  is  a  nonlinear  function  and  |i|  =  \f^R  +  E />  arg(i)  =  arctan[i//E«].  Combining  (4.39) 
with  (4.4oj  we  obtain  [3],  [4],  [23] 


e(i)  =  d(i)-x(i) 

=  <7[|i(0lkJarg[i(,)1-^(0 

-  ^S-WT  -  "■ 


(4.46) 


Hence,  the  error  term  is  independent  of  any  fixed  phase  rotation  of  the  signal  constellation. 
Equation  (4.27)  also  represents  the  Bussgang  technique  for  the  case  of  unknown  carrier  phase, 
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provided  we  substitute  e(i)  in  (4.27)  by  e(i)  of  (4.41). 

Example:  The  Godard  (or  CMA)  Algorithm  [22],  [50] 

Under  the  assumption  that  all  rotated  constellations  are  equally  likely,  Godard  [22]  suggested 
that  </[|i|]  in  (4.41)  be  chosen  as 

g[|*|]  =  1*1  +  ^r1  -  l*|2p_1  (4-47) 

where  Rp  is  a  real  constant.  As  we  shall  see  this  form  has  some  very  nice  properties.  Special 
cases  of  (4.42)  include 

9[\i\]  =  (l  +  tf-2)|ij-|i|3  (P  =  2) 


and 


9([\x\})  =  R,  (P  =  1). 


The  parameter  Rv  is  a  gain  constant  which  has  to  be  chosen  according  to  (4.30).  Since 


s[*(1')] 


*(Qg[l*(0l] 

l*(0l 


(4.48) 


combining  (4.43)  with  (4.30),  we  obtain 


E{\x(i)\2 + -  i*(*)i2p}  =  zmoi2} 


or 


,  _  £{l*(012p} 
p  it{|i(0lp} 


(4.49) 


26 


At  perfect  equalization,  x(j)  =  x(i)eje  (assuming  time  delay  D  =  0),  and  thus 

Rp  =  — — ,  where  mp  =  £'{|x(z)|p}. 

TTXp 

Combining  (4.34)  and  4.43),  we  obtain  the  Godard  performance  index  nonlinearity,  namely, 

<?(*(»))  =  ^-(|i-(h)|p-i?p)  (4.50) 

2  P 

Fig.  4.6  summarizes  the  nonlinear  functions  of  the  Bussgang  iterative  techniques. 

4.2.4  The  Sato  and  Benveniste  Goursat  Algorithms 

Sato  [46]  introduced  ttie  first  blind  equalization  scheme  in  1975  by  introducing  the  sign  non¬ 
linearity  to  generate  the  desired  response  of  the  adaptive  scheme  shown  in  Figure  4.1,  he., 
d(i)  =  7  sgn  [f(i)].  In  1986,  Sato  [47]  extended  his  1-D  PAM  algorithm  to  the  multidimensional 
blind  equalization  problem  where  all  transmitted  signals  become  vector  processes  and  all  impulse 
responses  (channel  and  equalizer)  are  square  matrices.  The  extension,  however,  is  straightfor¬ 
ward.  For  example,  in  the  two-dimensional  case  of  QAM  signals  the  “sign”  nonlinearity  becomes 
the  “complex  sign”  defined  by  (4.34).  The  error  signal  of  the  Sato  algorithm 

<?,(*)  =  7  cgn  [x(i)j  -  i(i)  (4.51) 

is  very  noisy  around  the  solution  unless  the  transmitted  sequence  x(i)  takes  only  the  values  ±1. 
In  other  words,  although  e3(i)  is  zero-mean  at  the  solution,  it  has  a  large  variance.  On  the  other 
hand,  the  Decision  Directed  (DD)  error  signal  eo(i)  =  x(i)  —  x(i)  (  see  Figure  4.6)  [33],  though  not 
robost  for  blind  equalizers,  enjoys  the  property  of  being  identically  zero  at  the  solution.  Hence, 
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Benveniste-Goursat  [5]  suggested  the  idea  of  combining  (heuristicaUy)  both  error  signals  in  the 
form  of  a  weighted  averaging  as  follows 

efiG(l)  =  ^'1  eo(i)  +  k2  es{i)  |eD(i)|  (4.52) 

where  k\,k2  are  constants.  The  rationale  behind  the  error  expression  (4.47)  is  the  following. 
Before  the  eye  of  the  equalizer  opens,  |eo(i)|  is  large  and  thus  the  Sato  error  es(t)  contributes  to 
the  proper  direction.  At  the  opening  of  the  eye  and  thereafter  | e /? ( i ) |  becomes  small  and  the  DD 
mode  of  the  error  esdi)  takes  over  to  speed  up  convergence  and  to  achieve  faster  rate  than  the 
original  Sato  algorithm  with  es(t)-  It  is  no  wonder,  therefore,  that  in  our  simulation  experience 
we  have  seen  the  Benveniste-Goursat  (BG)  algorithm  exhibiting  initially  very  slow  convergence. 

A  faster  convergence  rate  has  been  observed  only  after  the  eye  opens.  The  Benveniste-Goursat 
algorithm  may  be  seen  as  the  Sato  algorithm  that  switches  automatically  to  a  DD  one  when  the 
eye  of  the  equalizer  opens.  The  extension  of  the  Benveniste-Goursat  algorithm  to  a  Decision 
Feedback  Equalization  (DFE)  implementation  [2]  was  given  by  Macchi  et  al.  [32]. 

4.2.5  The  Godard  and  Donoho  (or  Shalvi- Weinstein)  Algorithms 

The  basic  motivation  behind  the  development  of  Godard’s  algorithm  introduced  in  1980  [22]  was 
to  find  a  cost  function  that  characterizes  the  amount  of  ISI  at  the  equalizer  output  independently 
of  the  carrier  phase.  Since  the  input  sequence  x(i )  is  i.i.d.,  the  cost  function  that  satisfies  the 
aforementioned  conditions  is 


Jlp)  =  £{(]i(r)Mx(0TO, 


(4.53) 
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which  depends  on  the  input  sequence,  For  p  =  2,  and  q  =  2,  takes  the  form 


J(2)  =  £’{|x(i)|4  +  |£(0|4  ~  2|x(i)|2|i(i)|2}  (4.54) 

where  w«  assume  that  Eix2(i)}  =  0.  However,  (4.48)  or  (4.49)  can  not  be  used  in  practice  because 
{x(i)}  is  inaccessible.  To  avoid  this  difficulty,  Godard  [22]  suggested  the  use  of  a  dispersion 
function 

D{p)  =  ^{(iiWr-fJp)9}  (4.55) 

which  was  shown  to  behave  like  the  cost  function  J and  yet  it  is  independent  of  the  input 
sequence.  Note  that  Rp  is  defined  by  (4.44).  Assuming  p  =  2,  q  =  2,  (4.49)  and  (4.50)  can  be 
written  as  [22] 

+  J2+ 

{4(£{|*(i)|2})2  ■  |/(0)|2  -  2(£'{|I(i)|2})2}  ■  £  l/(fc)|2  (4.56) 

k 

and 

K(2)  =  Ji  +  J2+ 

{4(£{|x(f)|2})2  •  |/(0)|2  -  2£{|*(i)|4}}  •  |p/(t)l2  +  R\  ~  £{ W*)|4}  j  (A57) 
where  J2'k  *s  ta^en  for  k  ^  0  and 

Ji  =  £{KOI4}(i-i/(o)2)  +  ^{KOI4}-El/(^l4> 

k 
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->t  -’})-■ 


(4.58) 


J 


(£im-)i2)  -t i/wr1 


Comparing  (4.51)  with  (4.52),  wo  soo  that  for  /X-’1  to  be  similar  to  J(j)  ,  the  following  inequality 


must  be  satisfied: 


4(£’{|jr(i)|”»”  I /( 0)|~  —  2£’{|x(i)|'1}  >  0 


or 


l/(o)|-  > 


2(E{\x(i)\-}y 


(4.59) 


Godard  suggests  (4.53)  and  f[i)  —  0  for  i  ^  0  as  a  way  of  initializing  his  algorithm. 

Based  on  what  has  been  reported  in  literature  [50]  and  on  our  simulation  experience,  the 
Godard  algorithm  has  always  converged  to  a  minimum  that  opens  the  eye  when  Godard’s  initial¬ 
ization  procedure  is  being  followed.  The  Godard  algorithm  is  summarized  in  (4.27)  and  Fig.  4.6. 
Its  convergence  for  p  =  2  is  better  than  p  -  1.  In  addition,  Godard  noted  that  convergence  im¬ 
proves  when  the  step  size  p  is  divided  by  2  at  each  10,000  iterations  [22].  The  Constant  Modulus 
Algorithm  (CMA),  suggested  independently  by  Treichler  and  Agee  in  1983  [50],  is  the  Godard 
algorithm  for  p  =  2  and  Ri  —  1.  Ding  et  al.  [15]  reported  that  the  Godard-type  algorithms 
exhibit  local  (not  global)  undesirable  minima. 

Shalvi  and  Weinstein  recently  introduced  [48]  a  blind  equalization  scheme  based  on  the  idea  of 
matching  the  kurtosis  measures  between  the  transmitted  sequence  {-r(0}  and  the  reconstructed 
sequence  {■?(<)}  at  the  output  of  the  equalizer.  The  kurtosis  ot  the  input  complex  sequence  x(i), 
is  defined  by 


A>(0)  =  E{\x(i)\4}  -  2£’2{|x(i)|2}  -  |£{x2(0|}|2  (4-60) 
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which  is  zero  for  complex  Gaussian  random  variables.  The  important  point  made  in  [43]  is  that 
if  E{|i(i)|2}  =  E {|x(i)|*},  then  (1)  |A'(i(t))|  <  |A'(x(i))|  and  (2)  |A'(£(i))|  =  |A'(x(t))|  if 
perfect  equalization  is  achieved.  Thus,  the  problem  is  to  maximize  the  magnitude  of  the  kurtosis 
measure  |A’(i(i))|  in  the  output  of  the  equalizer  at  each  iteration  subject  to  the  constraint 
£'{|£(i)|2}  =  £{  jx(i)|2}.  One  of  the  special  cases  of  the  Shalvi- Weinstein  algorithm  is  the  original 
Godard  algorithm.  It  has  recently  been  recently  reported  that  the  Shalvi- Weinstein  algorithm  was 
originally  introduced  by  Donoho  [16]  for  real-valued  signals  and  that  the  algorithm’s  convergence 
is  only  guaranteed  for  infinite-length  equalization  filters. 

4.2.6  The  Stop-and-Go  and  Decision-Directed  Algorithms 

The  basic  idea  behind  the  Stop-and-Go  algorithm,  which  was  proposed  by  Picchi  and  Prati 
[41]  in  19S7,  is  to  retain  the  advantages  of  simplicity  and  fast  convergence  (in  open  eye-pattern 
conditions)  of  the  Decision  directed  (DD)  algorithm  [33]  while  attempting  to  improve  its  blind 
convergence  capabilities. 

The  adaptation  error  ep(i)  used  in  the  DD  algorithm  is  [33] 

eo(t)  =  x(t)  -  x(i)  (4.61) 

where  x(i)  is  the  output  of  the  equalizer  and  i(i)  the  output  of  the  threshold  detector.  Assuming 
that  the  equalizer  initial  tap  setting  corresponds  to  a  closed  eye-pattern,  ec>(i)  will  be  large  most 
of  the  time  due  to  the  large  number  of  incorrect  decisions  x(z).  Consequently,  the  DD  algorithm 
cannot  converge  in  closed  eye-pattern  conditions. 

In  the  Stop-and-Go  algorithm,  Picchi  and  Prati  proposed  the  use  of  the  error  sequence 
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(4.62) 


*(0=  ^{--i(*>o(0  +  B(i)e’D(i)} 

where 


A(i)  =  IR(i)  +  h(i) 
B(i)  =  Idi)-h(i) 


and 


Ir{'1) 


1,  if  sgn[eD(i)]/?  =  sgn[es(0]R 

0,  otherwise 


11,  if  sgn[eD(i)]/  =  sgn[e5(i)]/ 

0,  otherwise. 

Note  that  es(i)  is  the  Sato  error  given  by  (4.46). 

From  the  foregoing,  it  is  clear  that  the  Stop-and-Go  algorithm  is  essentially  the  DD  algorithm 
when  the  eye  is  open.  It  is  mostly  during  closed  eye-pattern  conditions  that  the  Stop-and- 
Go  adaptation  rule  takes  place.  Also,  it  is  clear  that  the  Benveniste-Goursat  and  Stop-and- 
Go  algorithms  have  different  convergence  properties  when  the  eve-pattern  is  closed  and  similar 
convergence  properties  when  the  eye  is  open.  The  modifications  of  this  algorithms  have  been 
proposed  to  incorporate  joint  equalization  and  carrier  recovery,  decision  feedback  equalization  [l] 
as  well  as  fractionally  spaced  equalization  [21],  [45]. 
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4.3  The  CRIMNO  Algorithm 

Although  the  Bussgang  algorithms  are  different  from  each  other,  as  we  have  seen,  they  perform 
only  memoryless  nonlinear  transformations  on  the  equalizer  outputs  to  generate  the  desired  re¬ 
sponse.  This,  in  turn,  implies  that  the  cost  functions  they  attempt  to  minimize  with  respect 
to  the  equalizer  coefficients  are  also  memoryless.  These  algorithms  do  not  explicitly  employ  the 
fact  that  the  transmitted  data  are  statistically  independent,  which  is  the  essence  of  the  new  crite¬ 
rion  we  introduce  in  this  section.  Since  statistical  independence  of  the  transmitted  data  involves 
more  than  one  data  symbols,  this  results  in  a  memory  nonlinear  transformation  on  the  equalizer 
outputs  and  thus  a  memory  nonlinear  cost  function. 

4.3.1  Criterion  with  Memory  Nonlinearity 

As  we  have  seen,  Godard  solves  the  blind  equalization  problem  by  proposing  a  cost  function 
which  is  independent  of  the  transmitted  data,  and  yet  reaches  its  global  minimum  at  perfect 
equalization.  The  Godard  cost  function  (  also  known  as  the  constant  modulus  algorithm  (CMA) 
[22]  is  given  by  (4.50)  and  (4.44). 

Note  that  only  the  expected  value  of  some  function  of  the  current  equalizer  output  appears 
in  Godard’s  cost  function.  Therefore,  the  Godard  criterion  only  makes  use  of  the  probability 
distribution  of  the  transmitted  data.  It  does  not  explicitly  use  the  fact  that  the  transmitted  data 
are  statistically  independent. 

Assume  that  perfect  equalization  is  achievable  and  consider  the  situation  where  perfect  equal¬ 
ization  has  indeed  been  achieved.  That  is 


z(t)  =  x (i  -  D ) 
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where  d  is  some  positive  number,  which  accounts  for  the  delay.  Since  the  transmitted  data 
x(i)  are  statistically  independent  from  each  other,  so  are  the  equalizer  outputs  Hi)  at  perfect 
equalization.  In  addition,  for  most  transmitted  data  constellations,  the  mean  of  transmitted  data 
x(i)  is  zero.  Therefore,  at  perfect  equalization  ,  we  have 

E{x(i)xm(i  -  /)}  =  E{x(i  -  D)x'(i  -1-  D)}  =  E{x(i  -  D)}  ■  E{x‘{i  -/-£>)}  =  0 

By  making  use  of  this  property  and  combining  it  with  Godard’s  criterion,  we  obtain  a  new 
criterion,  caLled  criterion  with  memory  nonlinearity  (CRIMNO),  which  is  the  minimization  of  the 
following  cost  function: 


=  w0-E(|i(i)|P  -  Rp)2  +  wi  | E{x(i)x’(i  -  1)}|2  4 - b  WM\E{x(i)xm(i  -  M)}|2.  (4.63) 

The  rationale  behind  the  CRIMNO  is  that  since  each  term  reaches  its  global  minimum  at 
perfect  equalization,  by  appropriately  combining  them,  we  can  increase  the  convergence  speed  of 
the  corresponding  CRIMNO  algorithm  [12],  [13],  This  is  clearly  demonstrated  in  the  simulations 
section. 

Remarks: 

1.  Memory  nonlinearity:  the  CRIMNO  cost  function  depends  not  only  on  the  current  equalizer 
output,  but  also  on  the  previous  equalizer  outputs.  As  such,  it  results  to  a  criterion  with 
memory  nonlinearity.  The  parameter  M  determines  the  size  of  memory. 

2.  Generalization  of  the  Godard  criterion:  when  wo  =  1,  in,  =  0  for  i  0,  the  CRIMNO 
cost  function  reduces  to  the  Godard  cost  function.  Therefore,  the  CRIMNO  criterion  may 
be  seen  as  a  generalization  of  the  Godard  criterion. 
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3.  Constant  Modulus  Property:  the  CRIMNO  criterion  preserves  the  constant  modulus  prop¬ 


erty  inherent  in  Godard. 


4.3.2  CRIMNO  Blind  Equalization  Algorithm 


Define  the  equalizer  coefficient  vector  u(i)  =  [ui(i),  ■  •  • ,  u^r(i)]T,  and  the  received  signal  vector 
y(i)  =  [j/(i),  ■  ■  ■  ,y(i  —  N+  1)]T,  where  N  is  the  length  of  the  equalizer.  Then  the  equalizer  outputs 
are 


x(i  -  l)  -  yT(i  -  /)  •  u(i),  l  =  0, 1,  •  •  •,  M,  (4.64) 

where  superscript  T  denotes  transposition  of  a  vector. 

Differentiating  the  cost  function  with  respect  to  the  equalizer  coefficient  vector  u{i),  we 
obtain  [12] 


8MW 

du{  i ) 


=  4w0E[y‘(i)x(i)(\x(i)\‘2  -  R2 )] 


+2w\[E(y’(i  -  l)x(i))E(x'(i)x(i  -  1))  +  E(y'(i)x(i  -  1  ))E(x(i)x'(i  -  1))] 


+  ••• 


+2wM[E(y‘(i  -  M)x(i))E(xm(i)x(i-  M))  +  E(£(i)x(i- M))E(x(i)x'(i  -  A/))].  (4.65) 


By  using  the  steepest  descent  method  to  search  for  the  minimum  point,  we  obtain 


u(i  +  1)  =  u(i)  -  a  •  {4u.’0£(j/*(z)i(i)(|i(i)!2  -  R 2] 
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+2wi{E(\f(i  -  1  )x(i))E(i~(i)x{i  -  \))  +  E(yT{i)i(i  -  1  ))E{x(i)i’(i  -  1))] 


+2«i?A/[£(jf(i  -  M)Hi))F(i’(i)x(i  -  A/))  +  E{,f{  i)x(  i  -  M  ))E(x'i)x’(i  -  M))](4.66) 

where 

1L(  0  =  - «.v(i)]T 

In  (4.6),  the  expectation  are  the  ensemble  averages  taken  with  respect  to  transmitted  data  x(i) 
while  the  channel  impulse  response  J(i)  and  the  equalizer  coefficients  u(i)  are  treated  as  fixed. 

If  we  use  single  point  estimates  for  the  ensemble  averages,  we  obtain  the  stochastic  gradient 
CRIMNO  algorithm: 


u(i  +  1)  =  u(i)  -  a[4w0y’ (i)x{i)(\x{i)\2  -  R2)  +  2wi(y’(i  -  l)i(i)|x(i  -  l)j2  +  y‘{i  -  1  )x(i  -  l)|i(i)|2) 
+  -  ■  ■  +  2w\i{y*{i)x(i)\x{i  -  A/)|2  +  y’{i  -  M)x{i  -  A/)|i(i)|2)] 

=  U.(i)  -  a[j/*(i)i(j)  *  (4u,’0|x(i)|2  +  2w\\x(i  -  1)|2  +  •  ■  •  +  2w\{\x(i  -  A/ ) | 2  -  4 w0R2) 

+2  wxf{i-  1  )x(i-  l)|i(t)|2  +  •  •  •  +  2w\fym(i  -  M)x(i  -  A/)|x(i)|2].  (4.67) 

Note  that  at  each  iteration,  all  equalizer  outputs  x(i  -/),/  =  0,  1 ,  •  •  • ,  A/  are  recalculated  using 
current  (most  recent)  equalizer  coefficient  vector  u( i )  via  x(i  —  l)  —  yT(  i  —  l)u(i).  This  requires  a 
lot  of  computations.  If,  instead  of  using  the  current  equalizer  coefficient  vector  u(i),  we  use  the 
delayed  equalizer  coefficient  vector  u(i  —  l)  to  calculate  x(i  -  l).  Note  that  (for  small  step-siz~ 
which  is  required  for  the  stability  of  stochastic  gradient-type  algorithm,  the  difference  between 
u(i )  and  u(i  —  l)  is  negligible.  Then  at  each  iteration  we  will  need  to  calculate  only  one  equalizer 
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output  i(i)  using  the  current  equalizer  coefficient  vector  u(i). 

4. a. 3  Adaptive  Weight  CRIMNO  Algorithm 

L  ite  shape  of  the  cost  (unction  depends  on  the  choice  of  weight  So  does  the  performance  of 
the  CRIMNO  algorithm.  Here,  wo  d  escribe  an  ad  hoc  way  o(  adjusting  the  weights  on-line  in  the 
blind  equalization  process. 

The  basic  ic  ea  is  to  estimate  the  values  of  ail  terms  in  the  CRIMNO  cost  function  over  a 
block  ot  data  and  then  set  the  weights  used  in  the  next  block  proportional  to  the  deviation.,  of 
the  corresponding  terms  from  their  ideal  values  at  perfect  equalization.  The  rationale  behind 
this  scheme  is  that  if  one  term  in  the  criterion  has  a  large  deviation  from  its  ideal  value,  then  in 
the  next  block  the  weight  associated  with  it  will  be  set  equal  to  a  large  value,  and  consequently, 
the  gradient-descent  method  will  bring  it  down  quickly. 

To  elaborate  on  this  idea,  we  rewrite  the  CRIMNO  cost  function  as 


-  u'u  Jo  +  aq  Jj  +  •  ■  •  +  w.uJ\i,  (4.68) 

where 

Jo  =  £'(|i(i)|p-flp)2 
Ji  =  |£(i(i)f(i  -  l)\2  1  <  l  <  M. 


T 


(4.69) 


Politic  tlu’  dec  i.it  mil  i>(  (In'  /tli  term  />(./;)  liy 


j}°'  |. 


(■!.?()) 


here  .r  is  tho  value  of  at  perfect  equalization  ( j} 


(•’) 


0,  /  =  l,---.  A/).  Then  the  weights 


ate  udiu-ted  using  the  following  fonnulao: 


“'o 


I  loD(Jo)  <  A 

I  A  ~:qD{Jq)  >  A 

iD(Ji)  7  D(J[)  <  A 
]  A  iD(J,)>  A 


(4.71) 


where  \0  >  0  is  the  scaling  constant  for  the  first  term.  7  >  0  is  the  scaling  constant  for  the  other 
terms  in  the  CRIMNO  cost  function,  and  A  is  a  constraint  on  the  maximum  value  of  the  weights 
to  guarantee  the  stability  of  the  algorithm. 

l  iie  CRIMNO  algorithm  with  weights  adjusted  in  this  way  is  called  adaptive  weight  CRIMNO 
algorithm.  Some  in-depth  comments  are  provided  below: 


1.  When  tin*  deviation.-  of  all  terms  vary  proportionally,  the  adaptive  weight  scheme  be¬ 
come-  an  adaptive  step-si/e  algorithm.  Moreover,  the  adaptation  is  done  automatically. 
So  w  :;e:i  r  he  algi  r's’hm  converges,  then  weights  decrease  to  zero.  Hence,  the  adaptive 
w-iga-  CRIMNO  algorithm  acquires  as  a  byproduct  the  decreasing  step-size,  which  has 
proven  to  be  an  op’irnal  strategy  for  equalization  [51]. 

J  F-  -r  a  weight  CRIMNO  algorithm,  the  shape  of  the  cost  function  is  changing. 


38 


The  local  minima  of  the  cost  function  are  also  changing.  Thus,  what  is  local  minimum  of 
the  cost  function  at  one  iteration  may  not  be  at  the  next  iteration.  However,  whatever 
the  change  of  the  weights,  the  global  minimum  does  not  change,  and  it  always 
corresponds  to  perfect  equalization. 

3.  The  adaptive  w'eight  CRIMNO  algorithm  tends  to  move  out  of  a  local  minimum  of  the  cost 
function  quickly,  if  the  cost  function  has  local  minima  and  the  algorithm  gets  trapped  in 
one  of  them.  This  is  based  on  the  following  arguments.  In  the  adaptive  weight  CRIMNO 
algorithm,  the  equalizer  coefficient  increment,  Au(i  + 1)  =  u( i  + 1  )  —  u[i)  is  a  random  vector, 
the  variance  of  which  determines  how  fast  the  algorithm  will  move  out  of  a  local  minimum. 
The  variance  of  the  equalizer  coefficient  increment  depends  on  the  step-size  a,  gradient 

o  r 

and  the  weights  W[  (proportional  to  D{Ji)).  The  step-size  and  gradient  are  the  same 
with  the  fixed  weight  CRIMNO  algorithm;  we  thus  concentrate  on  the  third  one:  uq,  or 
equivalently  D(J().  At  a  global  minimum  of  the  cost  function,  D(Ji)  are  all  small,  thus, 
the  variance  of  the  equalizer  coefficient  increment  is  small.  Therefore,  the  algorithm  will 
remain  near  the  global  minimum.  However,  that  is  not  the  case  with  a  local  minimum.  In 
that  case,  D(Ji)  will  be  large,  therefore,  the  variance  of  the  equalizer  coefficient  increment 
will  be  large  (relative  ot  the  case  at  the  global  minimum),  and  the  algorithm  will  move  out 
of  that  minimum  quickly.  Moreover,  the  larger  the  deviation  the  more  quickly  the 

algorithm  will  move  out  of  the  local  minimum. 

4.  Blocks  of  data  are  used  to  estimate  {•//}.  The  block  length  should  be  sufficiently  long  to 
make  the  variances  of  the  estimates  small,  but  not  long  enough  to  make  the  weight  update 
fall  behind. 
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4.3.4  CRIMNO  Extensions 


In  this  section,  the  CRIMNO  ideas,  he.,  memory  nonlinearity,  are  extended  to  the  following  cases: 
(1)  the  case  of  correlated  inputs;  (2)  tiie  case  when  higher-order  correlation  terms  [38]  are  utilized. 


Colored  CRIMNO 

One  of  the  key  assumptions  in  the  CRIMNO  criterion  is  that  the  transmitted  data  are  independent 
and  identically  distributed  ( i.i.d ).  However,  in  practice,  this  may  not  be  true  for  QAM  signals. 
Usually,  in  order  to  overcome  the  phase  ambiguity  caused  by  the  squaring  loop  foi  carrier  recovery, 
differential  encoding  techniques  are  used,  which  correlate  the  input  data  when  the  source  symbols 
are  not  equiprobable.  Since  the  operations  of  differential  encoding  are  known,  the  autocorrelations 
of  the  input  data  can  be  derived.  In  the  case  where  the  autocorrelations  of  the  input  data  are 
known  a  priori,  the  CRIMNO  criterion  can  be  modified  as  follows: 


=  wqE{\x{i)\?  -  Rv?  +  w,\E{xWx\i-l)- ptf  +  ■  ■  -  +  wm\E(x{i)x-{i- M))- for]2  (4.72) 


where  fli  =  E(x(i)xm(i  -  /))  are  the  known  autocorrelations  of  the  transmitted  data. 
Higher-Order  Correlation  CRIMNO 

Here,  a  criterion  which  exploits  the  higher-order  correlations,  such  as  the  fourth-order  statistics 
of  the  equalizer  output,  is  given  below: 


M[p)  =  iro£(|x(iT-  Rp)24  X>,|£(i(f)r(z-/))|2 

l 

+  vjkl\E(x(i)x*(i  -  j)x(i  -  k)x’(i  -  /)|2  (4.73) 

j,k,l  all  different 
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The  performance  of  both  (4.73)  and  (4.74)  criteria  needs  to  be  investigated. 

4.3.5  Computer  Simulation 

Computer  simulations  have  been  conducted  to  compare  the  performance  of  the  adaptive  weight 
CRIMNO  algorithm  with  that  of  the  Godard  (or  CMA)  algorithm.  Fig.  4.6  shows  the  perfor¬ 
mance  of  the  adaptive  weight  CRIMNO  algorithm,  compared  with  that  of  the  Godard  algorithm 
under  the  different  step-sizes,  including  the  optimum  one:  We  see  that  the  performance  of  the 
adaptive  weight  CRIMNO  algorithm  is  better  than  or  approaches  that  of  the  Godard  algorithm 
with  optimum  step-size.  Fig.  4.7  shows  the  performance  of  the  adaptive  weight  CRIMNO  algo¬ 
rithm  for  different  memory  sizes  (M  =  2.4.6).  Fig.  4.8  shows  that  the  corresponding  eye-patterns 
at  iteration  20000.  We  see  that  the  larger  the  memory  size  A/,  the  better  the  performance  of 
the  adaptive  weight  CRIMNO  algorithm.  Table  4.2  lists  the  computational  complexity  of  the 
CRIMNO  algorithm,  the  adaptive  weight  CRIMNO  algorithm,  and  the  Godard  algorithm.  We 
see  that  there  is  only  a  little  increase  in  computational  complexity.  Therefore,  the  performance 
improvement  is  achieved  at  the  expense  of  little  increase  in  computational  complexity. 

5  ALGORITHMS  WITH  NONLINEARITY  IN  THE  INPUT 
OF  THE  EQUALIZATION  FILTER 

The  Polyspectra  Based  Techniques 

Another  class  of  blind  equalization  algorithms  are  those  algorithms  which  are  based  on  higher- 
order  cumulants  or  polyspectra  [36].  such  as  the  tricepstrum  equalization  algorithm  (TEA) 
[24],  the  power  cepstrum  and  tricoherence  equalization  algorithm  (POTEA)  [7],  and  the  cross- 
tricepstrum  equalization  algorithm  (CTEA)  [8].  All  these  algorithms  perform  nonlinear  transfor- 
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mation  on  the  input  of  the  equalization  filter.  This  nonlinear  transformation,  e.g.  the  generation 
of  the  higher-order  cumulants  or  polyspectra  of  the  received  data,  is  a  memory  nonlinear  trans¬ 
formation,  because  it  employs  both  the  present  and  the  past  values  of  the  received  data.  The 
use  of  the  higher-order  statistics  of  the  received  data  is  necessary  for  blind  equalization,  since 
the  correct  phase  information  about  the  channel  can  not  be  extracted  from  only  the  second-order 
statistics  of  the  received  data  [14],  [29],  [34],  [35],  [37],  [42]. 

5.1  Definitions  and  Properties:  Cumulants  and  Higher  Order  Spectra 

The  readers  are  assumed  to  be  somewhat  familiar  with  the  basic  material  of  higher-order  spectra. 
However,  some  important  properties  which  will  be  used  in  the  subsequent  sections  are  given. 

5.1.1  Definitions 

1.  Definition  of  Cumulants: 

Given  a  set  of  n  real  random  variables  {xi,Z2>  •  •  •,!„},  their  nth  joint  cumulants  of  order 
is  defined  as 


L(x  i,x2, 


■>*»)  =  (-JY 


dnln$(vi,v2, 

•••»»„) 

dvldv2  •  ■ 

■dvn 

V\  =  v2  =  ■  •  ■  =  Vn  =  0 


(5.1) 


where 


•••,««)  ~  £{exp  j(v ixi  +  •  •  •  +  v„in)}. 


(5.2) 


Given  a  real  stationary  random  sequence  {i(t)}  with  zero  mean,  £{x(i)}  =  0,  then  the 
nth-order  cumulant  of  the  random  sequence  depends  only  on  the  time  difference  and  is 
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defined  as 


Lx(tut2,  ■  ■  ■ ,  rn_i)  =  (-j)r 


dnln$x(vl,v2 

dv\dv2  ■  ■ 

dvn 

Vi  =  v2 


Vn  =  0 


(5.3) 


where  Ti,t2,  ■  ■  • ,  are  integers  and 


^2,  ■••,«„)  =  £  {exp  j  («i x(i)  +  v2x{i  +  n)  +  -  •  •  +  vnx{i  +  rn_i))}  (5.4) 

Given  a  set  of  real  jointly  stationary  random  sequences  {xfc(t)},  k  =  l,2,---,n  with  zero 
mean,  E{xk(i)}  =  0,  then  the  nth-order  cross-cumulant  of  the  sequences  depends  only  on 
the  time  difference  and  is  defined  as 


r  /  \  A  ,  .sndnln<^x>ix... ,„(t»i, v2,---,vn) 

Ai,i>2,...,n(n,r2,---,rn_1;  =  {-})  —  — 


dv\dv2  •  •  ■ dvn 


vx  =  v2  =  •  •  •  =  vn  =  0 


(5.5) 


where  rj,  r2,  •  •  •,  rn_i  are  integers  and 


$x,i,2,-,n(vi,v2,-  •  -,vn)  =  E  {exp  i(uix1(i)  +  v2x2(i  +  rj)  +  •  •  •  +  Vnxn(i  +  rn_i))}  . 

(5.6) 

2.  Definitions  of  Higher-Order  Spectra. 

Higher-order  spectra  are  defined  to  be  the  Z-transforms  of  the  corresponding  cumulants 
[34],  [38],  Specifically,  a  nth-order  spectrum  of  a  real  stationary  zero  mean  random  se¬ 
quence  {x(i)}  is  just  the  (n  -  I)-dimensional  Fourier  transform  of  the  nth-order  cumulant 
Lx(t1,t2,  ■  •  •,  rn_x)  of  the  random  sequence.  That  is 
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Sr(~l  t  ~2  '  "  '  •>  zn  —  1  )  —  'y  '  Lt(Tj  ,  T},  '  •  •  ,  r„_  j  )  1 1  f>  [ 


T 1  .T2, 


When  n  =  2,3,4  the  corresponding  spectrum  is  called  power  spectrum,  bispectrum,  and 
trispcctrum,  respectively. 

A  nth-order  cross-spectrum  of  a  set  of  real  stationary  zero  mean  random  sequences  {ifc(i)}, 
k  =  1, 2,  •  •  • ,  n,  is  defined  as  the  (n  —  1)  dimensional  Z-transform  of  the  nth-order  cumulant 
Lx,\,2, ...,n(ri ,  r2,  •  ■  -  rn_i )  of  the  random  sequence,  that  is 


*^x,l,2,--',n(~l ,  l)  —  ^  *  f'x,l,2,---,n(D»  *^2?  '  *  *  >  1 )  1 1  zl  •  (^’®) 

n.Ta,-  -x„_i  ;=i 


3.  Definitions  of  coherence. 


Coherence  is  defined  as  the  higher-order  spectrum  normalized  by  the  power  spectrum. 
Specifically,  a  nth-order  coherence  of  a  real  stationary  zero  mean  random  sequence  z(i)  is 
defined  as 


*x(~D  i  ~2>  '  ’  ‘  i  zn—  1  ) 


_ -tr  (  Z1  ,  z2 1  *  *  ‘  >  ^n— 1  ) _ 

(5r(~h)sx(z2)  •  •  •5I(zn_1)5x(nr=-11  a:,-1# 


An  alternative  definition  for  the  nth-order  coherence,  which  is  equivalent  to  the  above 


definitions,  is 


o  /  \  A  ^x(~l ;  z2i  ~  ‘  ,  ~n— 1  ) 

•^-x( ~l  ?  ~2i  '  '  '  >  zn-l )  —  ,  _i  , 

.‘■’zWl  ?22 


(5.10) 


4.  Definitions  of  Cepstrum  of  Higher-Order  Spectrum 

The  cepstrum  is  defined  as  the  inverse  Z-transform  of  the  log  function  of  the  spectrum. 
Specifically,  a  cepstrum  for  the  nth-order  spectrum  of  a  real  stationary  zero  mean  random 


sequence  {x(0}  is  defined  ;is 


( i(^h  1  *  '  »  I"n  —  l)  —  [In  Zn  —  1  )]  (5.11) 

A  cepstrum  for  the  nth-order  cross  spectrum  of  a  set  of  real  stationary  zero  mean  random 
sequence  {:r(i)},  i  —  1, 2,  •  •  •,  n,  is  defined  as 

Cx,t,2,"',n(l"l  1  t2 ,  *  ’  *  I"n—  1 )  —  ^  [In  ‘^x.l.Z.-’-.nV ?  ~2i  *  *  *  j  ^n—  1  )]  (5.12) 

When  n  =  2,3,4,  the  corresponding  cepstrum  is  called  power  cepstrum,  bicepstrum  and 
tricepstrum,  respectively. 

5.1.2  Properties 

Some  important  properties  of  cumulants  are  shown  below. 

1.  If  X\,  X2,  ■  ■  • ,  in  can  be  divided  into  two  or  more  groups  which  are  statistically  independent, 
then  the  cumulant  L{x\, x2,  ■  ■  • ,  z„)  is  zero. 

Specifically,  if  {x(i)}  are  an  independent,  identically  distributed  random  variables,  the  nth- 
order  cumulant  of  the  sequence  {i(i)}  is 

Lx(ti,t2,  =  7^(rI)^(r2)  •  • -^(rn_i)  (5.13) 

2.  Cumulants  of  higher  order  (n  >  3)  are  zero  for  Gaussian  processes. 
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3.  If  (x(t)}  and  {y(i)}  are  statistically  independent  random  sequences  and,  z(i)  =  x (i)  +  y(i), 
then 

z  (  ,  7"2,  ■  ■  • ,  Tn  —  1  )  —  ix(Tl ,  T"2,  •  •  • ,  Tn_  j  )  -j-  fty  (  Tj  ,  72,  •  ■  ■ ,  j  ).  (5.14) 

5.2  Tricepstrum  Equalization  Algorithm  (TEA) 

5.2.1  Problem  Formulations 

We  assume  that  the  received  sequence  after  being  demodulated,  low-pass  filtered  and  syn¬ 
chronously  sampled  (at  rate  y)  can  be  written  as: 

y(i)  =  z(i)  +  w(i)  =  f(k)x(*  ~  k)  +  tt'(0  (5.15) 

k——L\ 

where  the  nonminimum  phase  equivalent  channel  impulse  response  {/(i)}  accounts  for  the  trans¬ 
mitter  filter,  non-ideal  channel  (or  multipath  propagation),  and  receiver  filter  impulse  response; 
the  input  data  sequence  (x(i)}  is  generally  complex,  non-Gaussian,  white,  i.i.d.,  with  £{x(i)}  = 
0,  E{x(i)3}  =  0  and  E{x(i)4}  -  3{£{x(f)2}]2  =  0;  for  example  {«(*)}  could  be  a  multi-level 

symmetric  PAM  sequence  or  the  complex  baseband  equivalent  sequence  of  a  symmetric  QAM 
signal;  the  additive  noise  {«/(*)}  is  zero-mean,  Gaussian,  generally  complex  and  statistically  in¬ 
dependent  from  (x(i)};  we  also  assume  that  the  channel  transfer  function  F(z)  (Z-transform  of 
{/(*)})  admits  the  factorization  [24] 


F(z)  =  A-I(z~1)-0{z) 


(5.16) 


the  factor  I(z  *)  —  .JtkZ. — 1  |aJ  <  l,jc^|  <  1,  is  a  minimum  phase  polynomial,  i.e.,  with 


zeros  and  poles  inside  the  unit  circle.  The  factor  O(z)  =  njt=i(l  -  f>fc2).|f’*:|  <  1  is  a  maximum 
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phase  polynomial,  i.e.,  with  zeros  outside  the  unit  circle.  The  parameter  A  is  a  constant  gain 
factor.  Finally,  the  sequence  {y{i)}  is  the  input  to  the  blind  equalizer. 

5.2.2  Relations  of  Tricepstrum  of  the  Linear  Filter  Output 

The  input  to  the  channel,  x(i),  is  a  non-Gaussian  i.i.d.  random  sequence,  thus 

Sx(z\,Z2,zz)  =  lx-  (5.17) 

The  trispectrum  of  the  output,  y(i),  of  the  channel  (linear  filter)  is 


Sy(*i.*2,*3)  =  1xF{zi)F{z2)F( z3)F(zi  lz2lz31) 

=  lx-A4-  I(z-l)I(z-1)  ■  I(z3 l)  •  J(2l, z2,  z3)  0(Zl)  ■  0(z2)  •  0(z3)  •  Oiz^z^z-1 15.18) 

Taking  the  logarithm  of  5v(zi ,  z2,  z3)  and  then  the  inverse  Z-transform,  after  some  manipulation, 
we  obtain  [24] 
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log(7xA4) 

0 
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11 
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£ 

— L  j\{m) 
m  “ 

m  >  0,  n  =  l  =  0 

- ;A(n ) 

n>0,  m  —  l  =  0 

-}a«) 

l  >  0,m  —  n  —  0 

J -#(-"*) 

m 

m  <  0,n  =  /  =  0 

n 

n  <  0,  m  —  l  =  0 

l  <  0,  m  =  n  =  0 

-IflW 

n 

m  =  n  =  Z  >  0 

^(n) 

m  =  n  =  l  <  0 

0 

otherwise 

where,  are  the  minimum  and  maximum  phase  differential  cepstrum  parameters  of  the 

system,  corresponding  to  I(z~l )  and  0(z),  respectively.  They  are  defined  as  follows: 


Lz 


def 


17 


k=  1 


k=l 


k- 1 


(5.20) 


In  addition,  the  following  identity  holds  between  the  fourth-order  cumulants  Ly(m,n,l)  and  the 
tricepstrum  cy(m,  n,  /)  : 


^A^^[Ly(m  —  J,  n,  l)  —  Ly(m  +  J,n  +  J,  l  J)]^  -\- 

J= 1 

OO 

J2  {^(J)  [Ly(m  —  J,n  —  J,l  —  J)  —  Ly{m  +  J,  n,  /)]}  =  —  m-Ly(m,n,l )  (5.21) 

J=i 
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where  we  define, 


r 

-A<J\  J  =  1, . .  .00 

J  •  Cy(J,  0,0)  =  < 

> 

J  =  1,2,...  are  the  minimum  and  maximum  phase  cepstral  coefficients  respectively, 
which  are  related  to  the  zeros  F(z).  However,  in  practice,  the  summation  terms  in  (5.21)  can  be 
approximated  by  arbitrarily  large  but  finite  values  because  A^  and  B^  decay  exponentially  as 
J  increases. 

In  practice  the  fourth-order  cumulants  Ly(-)  in  (5.21)  need  to  be  substituted  by  their  estimates 
Ly(-)  obtained  from  a  finite  length  window  of  the  received  samples  {y(l)}. 

The  TEA  algorithm,  uses  (5.21)  in  order  to  form  an  overdetermined  system  of  equations, 
i.e.,  we  have  more  equations  than  unknowns.  Then,  TEA  solves  this  overdetermined  system 
of  equations,  adaptively,  using  an  LMS  adaptation  algorithm.  At  each  iteration  an  estimate  of 
the  cepstral  parameters  {A*7)}  and  is  computed.  The  coefficients  of  the  equalizer  are 

calculated  for  {A^*}  and  by  means  of  the  iterative  formulas. 

5.2.3  TEA  Algorithm 

Let: 

{y(i)}:  The  received  zero-mean  synchronously  sampled  communication  signal. 

Ni,N2-  Lengths  of  minimum  and  maximum  phase  components  of  the  equalizer. 

p,q:  Lengths  of  minimum  and  maximum  phase  cepstral  parameters. 

My(m,n,l):  Estimated  fourth-order  moments  of  {y(i)}  at  iteration  (i). 

Ry(j):  Estimated  second-order  moments  of  {y(0}  at  iteration  (i). 

Ly\m,n,l):  Estimated  fourth-order  cumulants  of  {y(i)}  at  iteration  (i). 

Symmetric  PAM  or  QAM  Signaling: 
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In  general,  for  1-D  [e.g.  PAM)  or  2-D  ( e.g .  QAM)  signaling  with  symmetric  constellations: 


L\;\m,n,l)=  •  £<*>(«  -  /)-R</>(n  )•/?<;>(/  -  m)-R\;\l}R^\n  -  ..Q  (5.22) 

For  symmetric  square  (L  X  L)  QAM  constellations: 

l)  =  Af^\m.n,l)  (5.23) 

and  are  the  minimum  and  maximum  phase  differential  cepstrum  parameters  at  iter¬ 

ation  (t)  respectively.  L\  and  L2  are  the  orders  of  the  minimum  phase  and  maximum  phase 
components  of  the  FIR  channel,  respectively.  Note  that,  {a,},  |a,j  <  1  and  {^-},  |h,|  <  1  are 

the  zeros  of  the  minimum  and  maximum  phase  components  of  the  FIR  channel,  respectively. 
{u(i)}:  The  coefficients  of  the  equalizer  at  iteration  ( i ). 

{£•(£)}:  The  coefficients  of  the  equalizer  at  iteration  (i). 

At  iteration  (i):  i  =  1,2,... 

Step  1  Estimate  adaptively  the  Ly*(m,n,/),  ~M  <  m,n,l  <  M,  from  finite  length  win¬ 
dow  of  {jt(l')}  as  described  below.  M  should  be  sufficiently  large  so  that  Ly(m,n,l)  ~  0 
for  |m|,  |n|,  |/|  >  A/.  Assuming  that  at  iteration  (0)  we  have  received  the  time  samples 
{ t/(  1 ), . .  -2/(/]a g)}  we  proceed  as  follows: 

Stationary  Case  with  Growing  Rectangular  Window 

-  ( 1  -  rj(i))-  M{y'~l\m,n,l)  f  rj{i)  ■  y(S\)y(S\  +  m)y{S\  +  n)y(S\  +  /)  (5.24) 

Ry\j)  =  (1  -  R(0)-  Ry~l\j)  J|-  ri(i)  ■  y(S'2)y(S'2  +  j )  (5.25) 

where,  g(i)  =  T+7^’  S2  ~  min^  +  ^lag' 1  +  Aag  ~  m-> i  +  Aag  “  +  Aag  ~  0,  $2  =  min(i  + 


r 

1 


i  +  /|a„  -  _/).  Finally  substitute  (5.24)  anil  (5.25)  into  (5.22)  or  (5.2:1). 
Nonstationary  Case 
First  Way: 


for 

i  <  h 

use  (5.2  1),  (5.25)  with 

'HO- 

1  +  Aag 

for 

i  >  A 

use  (5.24),  (5.25)  with 

//(/’)  =  //  -  lixed 

(5.25 / 

il  should  have  a  small  value  (0  <  >i  <  1),  tor  example  //  =  0.01. 

Second  Way:  (for  symmetric  l1-  QAM  signaling) 

Since  in  this  case  the  second-order  moment  l?y(j)  =  0.  we  can  use  My(m,  n,l)  with  a  forgetting 
factor  ir.O  <  w  <  1  as  follows.  ( 5 is  as  before): 

( i  +  Aag )  •  .<^\ m .  0  =  tc  ■(*  -  I  +  /i^)  ■  d/  m.nJ)+y(St4)ff(StA  +  m  )y(  5‘  +  n  )y(  S\+l)  (5  27) 

and  substitute  (i  +  /)aa;)  ■  n.l )  for  Zy\m,  ft,/)  everywhere. 

Tnird  Way: 

Formulas  (5.24)  and  (5.25)  could  be  used  in  nonstationary  environments  by  reiniti.alizing  the 
algorithm  after  certain  number  of  iteration  or  when  a  channel  change  is  detected, 

Remarks: 

•  By  using  the  symmetry  properties  of  fourth-order  cumulants  only  v—  2  - —  cumulants  need 
to  be  calculated. 

•  The  assumption  that  /jaJ  data  have  been  received  at  iteration  (0)  avoids  ill  conditioning 
of  the  matrices  of  the  system  given  in  Step  3.  It  causes  a  delay  to  /ja g  at  the  input  of  tlm 
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equalizer. 

Step  2 

Select  p.q  arbitrarily  large  so  that  .4;/)  ~  0  and  B^'1  ~  0  for  I  >  p  and  J  >  q.  For  example, 
=  10  '4  <  very  small  constant) 


A(l)  ~  0 
B{J)  ~  0 


for 

for 


I  >  p  —  int 


J  >  p  =  int 


(5.28) 


where,  mi/  denotes  integer  part  and  max |a,|  <  q  <  1,  mai|6,|  <  0  <  1. 
Define:  w  -  max(p,  q),  z  <  %-,S  <  z. 

Step  3 

Using  the  relation: 


f  I,  nJ)-L\;\m  +  I,n  + I  J  +  I)\}  + 

i=i 

Y  {flJJ  j [J/ljn  -  J.n  -  J.l  -  J)  -  L^/m  +  J,n,lj\}  =  -m  •  L<’>(m,  n,  f)  (5.29) 

J- 1 

with  rn  =  -  w . -1,1 . w,n  =  -a . 0 . z  and  /  =  — s, . . . ,  0, . . . ,  5  to  form  the  overde- 

terrniried  system  of  equations: 


B(  i)  ■  a{ i )  -  p{  i )  t  =  0,  1,2 - 


(5.30) 


whe  /'it; 
forrr:  '  /.  ( 1 1 


.VP  x  (p  +  q)\  ( where 
n.l j  -  !,./{ a .  r.  A)}; 


,Vp  =  2w  x  (2a  +  1)  x  (2s  +  1))  matrix  with  entries  of  the 
aft)  =  . B(('}\  (T  denotes  transpose) 
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is  the  (p  +  q)  x  1  vector  of  unknown  cepstral  parameters;  p(i)  is  the  Np  X  1  vector  with  entries 
of  the  form  {-m  •  Ly(m,  n,l)}. 


Step  4 

Assume  that  initially  a(0)  —  [0, . . . ,  0]7'.  Update  a(i)  =  [ A(  1 .4^,  B(l), . . .,  B^]T 


a(i+  1)  =  a(i)  +  p(l)  ■  PH(i)  ■  e(i), 

e(i+l)  =  p(i)  -  P(i)  ■  a(i),  0  <  p(i)  <  2/tr{PH(i)  ■  P(i)} 

Step  5 

Calculate  the  equalizer  normalized  coefficients.  Initialize  0)  =  dinv(i,  0)  = 

estimate: 

1  fc+1 

hnv(i,k)  =  -  52  ■  *•»«(*.*  ~  «  +  1) 

*  n=2 

k  =  1 .,Ni 

6inv(i,k)  =  2  52  [-5((;rn)]  ‘  °inv{i,k  -  71+1) 

K n=fc+l 

k  =  —  1 , . . . ,  —  N 2 

where  (i)  is  the  iteration  index  taking  values  i  =  1,2,3...  Then, 

tl norm  (  ri  k  )  =  l irivi  7 ,  k  )  *  Oinv(  7,  fc),/?  — A2,...,0,...,  iVj 

where  {*}  denotes  linear  convolution. 

Step  6 


as  follows 

(5.31) 

(5.32) 

1  and  the 


(5.33) 

(5.34) 


(5.35) 
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Estimate  the  gain  factor  A(i)  as  follows:  In  step  (l)  we  have  already  calculated: 


k 

A/f(0)~Qx-£(/(*))2 

k 

where  Qx  —  E{[x(k))2} ,  7x  —  /•.’{( x(k))A}  -  3  ■  Q*  are  known.  Also: 

■  k+  i 

i(i,k)  =  -- ^  -  «  +  1),  fc  =  1 . P 

71  —  2 
•  0 

o(i,k)  =  —  V]  •  o(t,  A:  -  n  +  1),  k=— 

*  n=k+i 

and  f{i,k)-  i(i,  k)*o(i,  k),  {*}  denotes  convolution,  Q  j(i)  —  £Y(/(0  L-))2,  7/(0 
Then  (the  sign  of  -7^  cannot  be  identified): 

For  L-PAM  Signaling: 


For  ZO-QAM  Signaling: 


1 

0(0 


/Qx-Q/(py 

\  X/J°(0)  / 


0(0  Uy ’(0,0,0)/  VZy ’(0,0,0)/ 

since  7-r  <  0  for  equi-probable  Z0-QAM  signaling. 

Step  7 

Let,  y(i)  =  [y(i  +  N2),  ■■■,!/{  i  -  Ni)]T  and  [«norm(0]  =  [<Wm(0 -A0) - -  « 


(5.30) 


(5.37) 

E070.L0)4. 


(5.38) 


(5.39) 


0\AY)]r-  Fi- 
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nally,  the  output  of  the  TEA  equalizer  is: 


*(*)  =  -77-  '  [«norm(*)]T  ‘fit*)  (5.40) 

.4(0 

While  most  of  the  Bussgang  blind  equalization  algorithms,  which  are  based  on  non-MSE  cost 
function  minimization,  have  not  been  shown  to  be  globally  convergent  and  cases  of  their  mis- 
convergence  have  been  encountered,  the  TEA  algorithm,  designed  as  described  above,  is  a  more 
reliable  alternative,  as  it  guarantees  convergence. 

Remarks: 


1.  Since  Gaussian  noise  is  suppressed  in  the  fourth-order  cumulant  domain,  the  identification 
of  the  channel  response  does  not  take  into  account  the  observation  noise.  Consequently, 
the  proposed  equalizers  work  under  the  zero-forcing  (ZF)  constraint.  For  the  same  reason, 
we  expect  that  the  identification  of  the  channel  will  be  satisfactory  even  in  low  signal  to 
noise  (SNR)  conditions. 

2.  The  ability  of  the  tricepstrum  method  to  identify  separately  the  maximum  and  minimum 
phase  components  of  the  channel  makes  possible  the  design  and  implementation  of  different 
equalization  structures. 

3.  In  the  recursive  formulas  (5.37)  we  used  the  following  properties  that  relate  time  impulse 
responses  with  cepstrum  coefficients:  (i)  a  channel  and  its  inverse  have  opposite  in  sign  cep- 
strum  coefficients,  (ii)  the  cepstrum  coefficients  of  the  convolution  of  two  minimum  phase  or 
two  maximum  phase  sequences,  are  equal  to  the  sum  of  the  corresponding  cepstrum  coeffi¬ 
cients  of  the  individual  sequences  and  (iii)  two  finite  impulse  response  (FIR)  sequences  with 
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conjugate  roots  have  also  conjugate  cepstrum  coefficients.  These  become  unique  features 
of  the  TEA  equalizer  when  is  compared  with  other  equalization  schemes. 

4.  The  described  algorithm  is  based  only  on  the  statistics  of  the  received  sequence  {y(i)}  and 

does  not  take  into  account  the  decisions  at.  the  output  of  the  equalizer.  Consequently 

wrong  decisions  (and  thus  error  propagation  effects)  do  not  affect  the  convergence  of  the 
proposed  equalization  schemes. 

5.  Instead  of  using  the  LMS  algorithm  to  solve  adaptively  the  system  of  equations  (5.30), 
one  may  employ  a  Recursive  Least-Squares  (RLS)  algorithm  [25]  which  will  have  a  faster 
convergence  at  the  expense  of  even  more  computations. 

5.2.4  Power  Cepstrum  and  Tricoherence  Equalization  Algorithm  (POTEA)  [7] 

5.2.5  Relations  of  Power  Cepstrum  and  Tricoherence  of  the  Linear  Filter  Output 

The  problem  is  as  formulated  in  Section  5.2.1,  the  channel  output  y(i)  is  the  convolution  of  the 
non-Gaussian  i.i.d.  random  sequence  x(i)  with  the  channel  impulse  response  f(i)  plus  some 
noise.  The  cepstrum  of  the  power  spectrum  of  the  channel  output  y(i),  can  be  shown  after  some 
algebra  to  be  equal  to  [7], 

Zn|.42|  m  =  0 

.iU'WqflW]  m  >  0 

cPy(m)  =  <  m  (5.411 

-LF  q(-m)  +  #*(-"»)]  m  <  0 

m L  i 

0  otherwise 
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where  A^k\  are  the  minimum  and  maximum  phase  cepst.ral  coefficients  of  /•  (').  These  are  : 


4U) 


I  L< 

X>.  - 


(a. 42) 


where  { at )  and  {/; ,}  are  the  zeros  of  /•'(')  inside  and  outside  of  the  unit  circle  respectively. 

Remarks: 


1.  .4^,  /?(*)  dec  ay  exponentially  and  thus  their  length  can  be  truncated  in  practice  at  k  —  p. 
so  that  .4(p),  are  arbitrarily  small. 

2.  If  the  channel  F[z)  has  cepstral  coefficients  ,4^'*,  B^1'1 ,  its  inverse  filter,  fr(r), 
has  cepstral  coefficients  ~A^\  —B(>t\  it  is  also  shown  in  [7]  that,  if  wo  define  = 
AW  +  fl-(fc)  and  ry(k)  =  E{y(i  +  k)y'(i)}  .  then  the  following  relations  holds: 

v  (  v 

y.  S‘W[-ry{m  -  A')]  +  y  $^{r,j{rn  +  A:)]  -  mry(m),  m  =  1.  ■  •  -2 p  (5.44) 

k— 1  Jfc= 1 

where  p  is  some  integer,  the  choice  of  which  is  discussed  in  [24].  Now  let  us  consider  the 
cepstrum  of  the  tricoherence. 

(5.44) 


By{Z\,  Z>,  C;;)  - 


-S'r /( -I  •  ~2«  -a) 


.-1  ,-l 
•-’v'-l  >  ~2  ’-.I 


It  has  been  shown  that  the  trispectrum  of  the  received  data  satisfies: 


5^,.^.=:,)  =  y,r'(;rV'u-z)/-*ur!  )/•(• 


-i _  - 1 .  - 1 


r>7 


Therefore, 


Ry(Zl,Z-2,  <}) 


After  some  algebra,  we  obtain 


/-  ‘(-,1) /'  (  'j ) /•-(  ' ) /■’(  A  1  ■  1  ^ 

/•'( “U  •(',)/•  Ic,1 ) /  *(-', 


i?y(rn,  n,l)  —  ^ 


/h[.-1i|  m  —  0.  «  —  0,/  —  0 

-  7?<T'0]  7 n  >  O.n  =  0J  =  0 

-T[.-r<-m)  -  /?•<-")]  m  <  0, 7;  =  0,  /  —  0 
_i[.4*(n)  _  5*(")]  m  =  0.  n  >  0. 1  —  0 

_  If.4-(-«)  _  ,n  =  0.  r?  >  0,  /  =  0 

n L  J 


i[,4-(^)  _  /}(">)] 


m  -  n  —  l  >  0 


J-(.4(-^)  _  £-(-"»)]  m  =  Tl  =  l  <0 


.|f.4*(0  _£(')] 


m  =  0,n  =  0,/>0 


m  =  0,n  =  (),/  <  0 


otherwise 


Taking  the  logarithm  of  both  sides  of  (5.44),  we  obtain, 


Ry{zu=i,  --3)  =  \  [lnSy(zl,z2.~3)  -  lns;(  =fl, z;1 


Differentiating  with  respect  to  Z\  and  performing  inverse  Z-transform,  we  obtain 


2Ly(m,  n,  l)  *  Z*(  —  m.  —  n,  —  /)  *  \—mRy( rn.  n,  /)] 

=  L*(-m ,  —  n,  -/)  *  [-mijfm. «./)]  +  Lv(  m  .nj)  *  \  rn  L‘{  m.  n.  I  )j 
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By  defining  the  following  functions: 


0i(m,n,l)  =  - n , -/)  *  L,y(m.nJ) 

=  L*{  — m, —n, —I)  *  ti>  Lv(m,  n.  I ) 

are  combining  (5.49)  and  (5.50),  we  obtain: 


2#i(m,  n,l)  *  [mliy(  m,  n,  /  )j  -  02(  m,  n,  l)  +  02(  —  in,  —  n,  —  /) 


Defining  -  J3*W  and  combining  (5.47),  we  obtain: 

p 

Y,  Dm^[9i(rn  -  k.  n  -  kj  -  k)  -  6i(m  -  k.nj)} 
k-  =  i 

p 

+  E  (m  +  k,  n  -f  k ,  /  +  A-)  -  #i( in  -f  A:,  /)] 

c—=i 

=  02(m,  n,  /)  +  02(-m,  -n,  -/) 

A  rule  of  thumb  is  to  define  tr  =  p,  z  <  w/2,  h  <  z  and  then  take  m  —  —w,  ....  -1.  ] 
—2, . .  .2,  /  =  -/i, . .  .  ,/i  to  form  a  linear  overdetermined  system  to  equations. 

5.3  The  POTEA  Algorithm 

In  this  section  the  POTEA  algorithm  is  given  in  detail. 

Let 

N i,  N2:  Lengths  of  minimum  and  maximum  phase  components  of  the  equalizer. 
p\  Length  minimum  and  maximum  phase  cepstral  parameters. 
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(5.50 ) 


(5.51 ) 


(5.521 


!/'.  II  ~ 


At  iteration  i  —  1,2,.... 


Step  1  Estimate  adaptively  the  n,/)  for  -Mi  <  m.n.l  <  M\.  and  rl/’fm )  for  -  M> 

m  <  M-2  from  a  finite  length  window  of  {y(e)},  and  then  generate  the  following  funrt.io 


Step  2  Choose  p  arbitrarily  such  that  .4(p+1)  ss  0,  7jo'+0  ~  o  au(|  <l,.fiue  ir  - 


Step  3  Form  the  equations 


*jr  S*W(— ry(m  -  k)]  +  Y2  -?(A,)[r!/(r(l  +  *•'))  =  "ir,,(  "i).  in  -  1,  • 


•  ■ 


k=  l 


k=  1 


where  S ^  +  /?*(*),  k  —  l , 


-  k,n  -  k,l  -  k)  -  0i(m  -  k,n,l)} 

k=  1 
P 

+  y*^£)W[0|(ffl  +  k,n  +  kj  +  k)  -  9}  (  m  +  k,  n.  /)] 


t-=i 


=  #2(  "i,  a,  /)  +  9":(  -m,  —n.  - 1 ) 


and  the  following  system  of  equations 


Pa  -  p 

Qb=  q 


(a. 


where  the  matrices  P,a.p,Q,b  and  ij  are  defined  above. 


m 


Step  4  Solve  adaptively  the  above  systems  employing  LMS-type  adaptation  as  follows: 


a(i+l)  =  a(i)  +  n{i)PH(i)e(i) 
b(i+  1)  =  h(i)  +  n'(i)QH  (i)e'{i) 


where 


e(i)  =  p(i)  -  P(i)a(i) 


e'(i)  =  q(i)-Q(i)b(i) 


0 

0 


< 

< 


p{i)  < 
M(z)  < 


2 

tr(PHP) 

2 

tr(QHQ) 


The  algorithm  at  instant  i  minimizes  the  mean  square  error: 


j{i)  =  E{eH{i)e{i)} 
j\i )  =  E{e'H(i)e'(i)} 


(5.57) 

(5.58) 


Step  5  Calculate  and  as  follows: 


(5.59) 


Step  6  Calculate 


fc+t 


n=2 


(5.60) 
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1 

6eq(i,k)  =  t  ^2  [-B{(1i~n)]deq{i,k-n  +  l),k=  (5.61) 

*  n=A:+l 

with  initialization  :  ieq(i,  0)  =  deq(i,  0)  =  1.  The  normalized  ( A  =  1)  estimate  un0rm(i,k) 
at  iteration  (i)  is  given  by: 


^norm(E 


Step  7  Estimate  gain  factor  ,4(i) 

Step  8  The  reconstructed  transmitted  sequence  at  iteration(i)  is: 


*(*) 


k=- Ni 


^norm^i)  /r) 


(5.62) 


(5.63) 


Computational  Complexity 

In  this  section  the  computational  complexity  of  POTEA  is  presented  and  compared  with  the 
computational  complexity  of  TEA. 

PAM 

POTEA:  3(2M+1}3  +  3(2 M  +  1)  +  2 p{Np  +  p  +  1)  +  "2+s*+3  +  (4M)3log2  4M 
TEA:  3(2A^+1)3  +  3(2M  +  1) + (p  +  q)(2Np  +  1)  +  ‘^**±3 

QAM 

POTEA:  4[3i2^±li!  +  2(2 M  +  1)  +  2p(2JVp  +  4p  +  2)  +  +  (4M)3log2  4 M) 

TEA:  4(^1^  +  (p  +  q)(2Np  +  1)  + 
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5.4  Cross-Tricepstrum  Equalization  Algorithm  (CTEA)  [8] 
5.4.1  Problem  Formulations 


Assume  we  have  n  measurements  at  each  time  index  k,  y-.(k),i  -  1,2, . .  .n,  where 


Vi(k)  =  f,(k)  *  x(k)  +  ni(k) 


(5.64) 


(shown  in  Figure  5.1  for  n  =  4)  and 

1.  fi(k)  is  the  impulse  response  of  a  discrete  time  linear  time  invariant  system, 

2.  x(k)  is  a  non-Gaussian,  nth  order  white  process  with  cumulant  -yx  /  0, 

3.  n,(fc)  is  zero-mean  additive  noise,  with  n,(k)  independent  of  Uj(k)  for  i  ^  j  and  independent 
of  x(k).  No  assumptions  are  made  about  pdf  for  whiteness  (in  time)  of  n;(fc). 

We  also  assume  that  each  impulse  response  hfk)  is  stable  with  no  zeros  on  the  unit  circle  and 
that  its  Z  transform  Fi(z)  can  be  written  as  [8] 

Fx(z)  =  A,Ii(z-')Ot(z)  (5.65) 

where  the  A,  are  gain  constants,  the  r,  are  integer  linear  phase  factors, 

-  nhm- 


is  the  minimum  phase  component  and 


L,2 

Oi(z)=  U(l-bijz) 
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is  the  maximum  phase  component,  with  zeros  atJ  and  poles  ctJ  inside  and  zeios  btJ  outside  the 
unit  circle  (i.e.  |a,j|  <  1,|6,_,|  <  1,  and  |c,j|  <  l). 

5.4.2  Relation  of  Cross-Tricepstrum  of  the  Linear  Filter  Output 

With  the  above  assumptions,  the  nth-order  cross-spectrum  of  the  y{(k)  can  be  written  as 

n—  1 

Sy,  l,2,..„n(2l,Z2,-  ‘  -,2n-l)  =  1XF\  ( -1  )Fl{z2  )  •  •  •  Fn-l  (  Zn-1  )Fn(  ~f\)  (5.66) 

i  =  l 

Taking  the  logarithm  and  performing  inverse  ^-transform  on  both  sides,  we  obtain  after  some 
algebra  the  following  results: 

' 

ln~ix  mi  =  m2  =  . . .  =  mn_j  =  0, 

-(l/mi).4,(mi)  m,  >  0,mj  =  0,jf  <  i, 

i  =  1,2 _ _  n  -  1, 

(1  /mi)Bi(-mi)  m,  <  0,mj  -  0,  j  ^  i, 


Cj,,i,2,  -,n(mi,  m2,  •  •  • ,  mn_i)  =  (5.67) 

i  =  1,2, . . .,  n  —  1, 

-(1  fmn)An{-mn)  mx  =  m2  =  . . .  =  mn_i  <  0, 

-(1  lmn)Bn{mn)  mx  =  m2  =  . . .  =  mn_i  >  0, 

0  otherwise 


with 


Mk)  =  £K)fc -£>,,)* 

3=1  J= 1 

A  L'2 

Bi(k)  =  Y,(bt])k.  (5.68) 

j=i 
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This  results  means  that  the  n-tli  order  cross-cepstrum  is  non-zero  on  n  lines  only  in  its  domain 
and  that  on  each  of  these  lines  we  find  the  complex  cepstrum  of  a  zero- linear  phase,  scaled  versior 
of  one  of  the  n  impulse  responses. 

Now,  to  develop  a  least  squares  solution  for  the  A,  and  Bt,  we  take  first  partial  derivatives  of 
the  logarithm  of  (5.66),  independently  with  respect  to  each  of  its  variables,  followed  by  inverse 
Z  transforms.  Letting  Sy,i,2,...,n(mii  m2,  •  ■  • ,  mn-i)  denote  the  n-th  order  cross  cumulants  of  the 
y,,  we  get  the  following  n  —  1  equations  relating  the  cross  cumulants  to  the  cepstral  coefficients: 

5y,1)2,...n(tnx,m2, . .  .,mn)  *  (m,-  cy, i,2 . „(mi,  m2, . . mn_i)) 

=  -m,  5Vili2,...n(!,n1,  m2, . . 

for  i  —  l,2,...n  —  1.  Each  equations  involves  an  (n  —  1)  dimensional  convolution.  However, 
plugging  in  (5.67)  reduces  each  equation  to  a  single  finite  summation: 

GO 

^  \  A,(k)Syt\ ,2,...,n(^l »  ■  •  •  i  tn~  1 )  (k)Sy\  ,2,...,n(  U1 7  ti2,  •  •  • ,  Un—i  ) 

/t=l 

-An(k)Sy,lt2 n(m,  +  k,m ,  +  k, . . .,  m,  +  k) 

+Bn(k)Sytit 2 „(m,  -  k,  m,  -  k, . . . ,  m,  -  k) 

=  -mi5y,i,2,...,„(m1,m2,...,mn_1)  (5.69) 


where 


t%  =  m,  -  k 
ut  =  m,  +  k 
tj  =  =  TTlj  j  ±  i. 
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From  equation  (5.68)  the  sums  in  (5.69)  decay,  so  we  can  truncate  them  to  p,-  and  <7,  for 
the  terms  involving  A,  and  B,  respectively  (see  [8])  and  rewrite  (5.69)  as  a  finite  dimensional 
vector  dot  product  equation.  Writing  M  >  p,  +  qi  +  pn  +  qn  equations  at  M  points  in  the  n  -  1 
dimensional  domain  of  Sy,i.2,...n  we  can  form  the  overdetermined  system 

Rin  ■  £in  =  7in  (5.70) 


5.4.3  Cross-TEA  (CTEA)  Algorithm 

In  this  section  we  describe  the  CTEA  algorithm  for  blind  equalization  of  QAM  signals  with  four 
receivers.  The  algorithm  has  two  stages  at  each  iteration: 

1.  Channel  identification  and  deconvolution 

2.  Combining  by  use  of  a  decision  rule 
Channel  Identification  and  Deconvolution 

Step  1.  Estimate  the  cross-cumulants  and  kurtoses  of  the  received  data  recursively. 

Step  2.  Form  the  systems  of  equations  (5.70)  and  solve  each  system  in  turn  to  get  the  cepstral 
coefficients  for  each  channel1 

Step  3.  From  the  results  of  the  previous  step,  estimate  the  forward  and  inverse  channel  impulse 
responses  up  to  a  desired  length. 

Step  4.  From  the  estimated  forward  impulse  response  and  the  kurtoses,  estimate  the  gains  for 
each  channel. 

'The  cepstral  coefficients  for  channel  four  can  be  estimated  from  the  solution  of  one  of  the  three  systems  or  an 
average  of  all  three. 
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Step  5.  With  the  estimated  inverse  response,  an^  the  estimated  gain  for  each  channel, 

deconvolve  to  estimate  the  input  symbol  as 

i,U)  =  -^yiU)*  fj inw(k) 

A 

\ 

Combining  Decision  Rules 

As  illustrated  in  Figure  5.1,  from  the  four  estimates  x,(j)  we  need  to  form  a  single  quantized 
decisions  x(j).  We  describe  here  an  optimal  combining  rule  in  the  case  of  a  perfect  equalizer,  as 
well  as  three  sub-optimal  schemes,  arithmetic  mean,  majority  rule,  and  median  (which  for  n  =  4 
channels  is  equivalent  to  a-trimmed  mean  with  a  =  1). 

Optimal  Decision  for  the  Perfect  Equalizer  [8] 

We  consider  the  following  assumptions: 

1.  x(k)  is  complex  and  uniformly  distributed, 

2.  u,(k)  is  the  perfect  equalizer  for  fi{k),  i.e.  f,(k)  *  u,(fc)  =  6(k),  and 

3.  rii(k)  are  zero-mean,  complex  Gaussian  variables  with  known  variance  of  and  are  indepen¬ 
dent  across  channels.. 

Since  we  will  do  symbol  by  symbol  detection,  ve  will  drop  the  time  index  k  for  simplicity.  With 
these  assumptions, 

A 

Xi  —  x  -f  n,  *  u,  =  x  +  Tii. 

Therefore,  the  conditional  probability  density  of  x  given  A",  p(i|x),  is  complex  Gaussian  with 
mean  x  and  variance 

°i  -  ct,2  XI  iu*(it)i2- 

k 
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Since  the  noise  in  each  channel  is  independent,  the  maximum  likelihood  estimate  x  of  x  given 
the  four  observations  .r,  (assuming  x  to  be  from  a  continuous  distribution)  is 


Jr  = 


xi  - 


•t  E.  2E,r 


E.  K 


~-2 


jEjf, 2E,/ 

e.<e2 


where  the  subscript  R  and  I  denote  real  and  imaginary  parts  respectively.  Note  that  if  the  noise 
has  the  same  variance  in  all  channels  then  this  result  reduces  to  the  arithmetic  man.  If.  on  the 
other  hand,  we  assume  that  x  belongs  to  a  known  discrete  set  V  then  we  need  to  find  x  £  V 
which  satisfies 


min 

ieP 


—  x 


or  equivalently 

min  <E2  (l*!2  ~  2(xrx,,r  +  i/i ,,/))  . 

» 

Of  course  the  assumptions  of  perfect  equalization  and  known  noise  variance  are  not  realistic  in 
practice  so  we  describe  below  three  sub-optimal  combining  rules  which  we  tested  in  our  simula¬ 
tions. 

Arithmetic  Mean 


Step  1.  Form  a  soft  decision  statistic 

*U)  = 

4  1=1 

(If  information  is  available  about  the  relative  quality  of  the  channels  then  a  weighted  mean 
could  be  used.) 
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Step  2.  Put  x(j)  through  a  decision  device  to  get  i(j). 

Majority  Rule 

Step  1.  Put  each  estimate  through  a  decision  device  to  form  four  decision  statistics  xx(j). 

Step  2.  If  there  is  a  plurality  among  the  x,(j)  in  one  region  of  the  decision  space  then  that  is  the 
decision.  If  there  is  a  tie  (  all  four  different  or  two  votes  for  each  of  two  decisions)  use 
a  tie-breaking  procedure.  One  method  would  be  to  pick  the  decision  region  that  has  the 
smallest  average  squared  decision  error.  For  example,  if  xt(j)  =  i2 (j)  ^  ij(j)  =  iiij)'- 

Let  di  =  ^|x,(j)-i,0')|2 

;=i 

4 

Let  d2  =  £  -  *i(j)l2 

i=3 

Then 

Choose  ii(j)  di  <  d2 
X2  (j)  d2>d  i- 

Median 

Step  1.  Order  the  real  and  imaginary  parts  of  the  x,(j )  separately. 
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REAL{x(j)}  =  median{REAL{x,(  j)}} 


IMAG{x(j)}  —  median{IMAG{zj(j)}} 

Step  3.  Put  x(j)  through  the  decision  to  get  x(j). 

5.5  Computer  Simulations 

Computer  simulations  has  been  employed  to  compare  the  performance  of  the  blind  equalization 
algorithms.  The  performance  metric  used  are  those  in  Sections  2.  And  the  following  issues  are 
addressed. 

5.5.1  TEA  vs.  Bussgang-type  Algorithms 

Fig.  5.2-S.4  show  the  performance  of  the  TEA  algorithm,  compared  with  that  of  Bussgang- 
type  algorithms,  such  as  Godard,  Benveniste-Goursat.  Stop-and-Go  algorithms.  We  see  that  the 
TE.A  algorithm  opens  the  eye  much  faster  than  the  Bussgang-type  algorithms.  This  performance 
improvement  is  achieved  at  the  expense  of  larger  computational  complexity. 

5.5.2  POTEA  vs.  TEA 

Fig.  5. 5-5. 6  show  the  performance  of  the  POTEA  algorithm,  compared  with  that  of  TEA.  We 
see  that  the  POTEA  algorithm  converges  faster  than  the  TEA  algorithm.  The  performance 


improvement  is  achieved  at  the  expense  of  further  increase  in  computational  complexity. 


5.5.3  CTEA  vs.  TEA 

Fig.  5. 7-5. 8  show  the  performance  of  the  CTEA  algorithm  compared  with  that  of  TEA  algorithm. 

We  see  that  the  CTEA  algorithm  converges  faster  than  the  TEA  algorithm  for  some  channels. 

The  performance  improvement  is  achieved  at  the  expense  of  further  increase  in  computational 
complexity. 

6  ALGORITHM  WITH  NONLINEARITY  INSIDE  THE  EQUAL¬ 
IZATION  FILTER 

Still  another  class  of  bind  equalization  algorithms  are  these  algorithms  which  use  Volterra  filters 
[9],  [10]  or  neural  networks  [20],  [26],  [27].  This  class  of  algorithms  perform  nonlinear  operations 
inside  the  equalization  filter.  It  is  therefore  also  be  able  to  correctly  extract  the  phase  information 
of  the  unknown  channel  from  its  output  only.  In  this  section,  we  will  concentrate  on  those 
algorithms  based  on  neural  network. 

6.1  Review  of  Equalization  Techniques  Based  on  Neural  Networks 

Equalization  is  a  technique  which  is  used  to  combat  the  intersymbol  interference  caused  by  non¬ 
ideal  channels.  Usually,  equalizers  are  implemented  using  linear  transversal  filters  [17],  [18],  [30], 

[31].  However,  when  the  unknown  channel  has  deep  spectral  nulls  or  some  severe  nonlinear 
distortions,  such  as  phase  jitter  and  frequency  offset,  linear  equalizers  are  not  powerful  enough 
to  compensate  all  of  these.  That  is  why  nonlinear  filters,  such  as  those  implemented  by  Volterra 
filter  or  neural  network,  come  in  and  play  an  important  role. 

Neural  Networks  (NNWs)  are  mathematical  models  of  theorized  mind  and  brain  activities. 

The  fundamental  idea  of  NNWs  is  to  organize  many  simple  identical  processing  elements  into 
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layers  to  perform  more  sophisticated  tasks.  The  properties  of  NNWs  include:  massive  paral¬ 
lelism;  high  computation  rates;  great  capability  for  non-linear  problems,  continuous  adaptation; 
inherent  fault  tolerance  and  ease  for  VLSI  implementation,  etc.  All  these  properties  make  NNWs 
attractive  to  various  applications.  Several  neural  network  based  algorithms  have  been  proposed 
for  equalization  problems. 

1  Multi-Layer  Perceptron 

The  multi-layer  Perceptron  (MLP)  [39],  [40]  is  one  of  the  most  widely  used  implementations 
of  NNWs.  It  comprises  a  number  of  nodes  which  are  arranged  in  layers,  as  shown  in  Figure 
6.1.  A  node  receives  a  number  of  inputs  aq,  x2,  •  •  • ,  x„,  which  are  then  multiplied  by  a  set 
of  weights  uq,  w2,  •  ■  •,  wn  and  the  resultant  values  are  summed  up.  A  constant  v  is  added 
to  this  weighted  sum  of  inputs,  known  as  the  node  threshold,  and  the  output  of  the  node 
is  obtained  by  evaluating  a  nonlinear  (sigmoid)  function,  /(.),  which  is  called  activation 
function. 

The  architecture  of  a  perceptron  can  be  described  by  a  sequence  of  integers  n0 ,  ri2,  •  •  • ,  n*. 
where  no  is  the  dimension  of  the  input  to  the  network,  and  the  number  of  nodes  in  each 
successive  layer,  ordered  from  input  to  output,  is  nx,n2,-  ■  ■ ,  n/t .  In  this  notation,  the  MLP 
produces  a  nonlinear  mapping  g  =  Rn 0  — *  RHk . 

The  updating  of  the  connection  coefficients  of  the  MLP  is  done  iteratively  by  using,  back- 
propagation  (BP)  algorithm  with  the  following  formula: 

(wi+i,vi+i)  -  (wt,  Vi)  +  A,  (6.1) 


72 


anti 


A,  = 


dr2 

<1{  ,  V,  ) 


+  o  •  A,_ 


(6.2) 


2  Self-Organizing  Feature  Maps 

The  topology  by  self-organizing  feature  map  (SOM),  which  is  introduced  by  Kohonen  [26], 
[27]  consists  of  two  layers  of  nodes,  referred  to  as  input  layer  and  output  layer,  which  are 
fully  connected  with  different  connection  weights.  The  inputs  to  the  SOM  can  be  any 
continuous  values,  whereas  each  of  the  output-layer  node  represent  a  pattern  class  that  the 
input  vector  may  belong  to.  That  means  the  outputs  of  SOMs  are  discrete  values,  and 
therefore,  the  SOM  is  sometimes  also  referred  to  as  learning  vector  quantizer. 

The  SOM  works  iteratively  as  follows.  First,  find  the  set  of  connection  coefficients  \Vg 
which  is  the  closest  to  the  input  vector  Ak, 


||  Ak  -  Wa  ||=  mm  ||  Ak  -  Wj  ||  . 

J—1 


Second,  perform  the  following  quantization  of  the  output-layer  node: 

1,  if  II  Ak  -  Wg  ||=  min  j|  Ak  -  W3  j| 

0,  otherwise. 

and  then  move  Wg  closer  to  Ak  using  the  equation 


AIT,, 


a(k)  ■  [a*  -  \Vig],  j  =  g 

'  P(k)  ■  [flf  -  IT.j ] -  j  €  Nr,j  #  g 


r, 


(6.3) 


(6.4) 


(6.5) 
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where  ;Vr  is  the  topological  neigliborhood  of  the  winning  node  bg  which  consists  bg  itself 
and  its  direct  neighbors  up  to  the  depth  1.2,---,  and  a(k)  and  3(k)  are  the  learning  rate 
at  time  k. 


6.2  The  MLPs  Equalization  Algorithm  for  PAM  and  QAM  Signals 

The  applications  of  MLP  in  equalization  problems  so  far.  have  been  limited  to  binary  {0, 1}  or 
bipolar  {-1,1}  valued  data  and  real  valued  channel  models  [11],  [20],  [49].  In  this  section,  we 
introduce  for  the  first  time  a  new  implementation  structure  of  MLP  which  works  well  with 
L-PAM  (L  >  2)  and  N-QAM  (N;4)  signals. 

Looking  into  a  MLP  structure,  we  find  out  that  it  is  the  sigmoid  function  of  the  output 
layer  nodes  that  confines  the  network  outputs  to  the  range  [-1, 1],  In  our  equalization  problem, 
the  signals  are  equally  spaced  and  symmetric  with  respect  to  either  the  original  point  of  the 
coordinate,  or  to  the  x  and  y  axes.  Thus  we  can  just  scale  up  the  node  function  of  the 
output  layer  by  a  constant  factor  C  which  is  large  enough  to  cover  our  maximum 
signal  range,  e.g.,  [-15,15]  for  16-PAM  or  256-QAM  signals.  So,  for  the  output  layer,  we  have 
[30],  [40] 

=  (c>l)  (6.6) 

as  the  activation  function.  For  the  hidden  layers,  we  still  use  the  sigmoid  function 


/»(*) 


1  -  cor 
1  +  eax ' 


(6.7) 


The  idea  of  adding  another  constant  a  comes  form  the  thought  that  a  smaller  a .  equivalently, 
a  lower  slope  in  Figure  6.2,  would  avoid  high  vibration,  and  in  turn,  decrease  the  chance  of 
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divergence  in  the  course  of  weight  adjustment. 


lor  complex  channel  models  and  QAM  signals,  wo  use  complex  connection  coefficients  to 
get  the  weighted  sum  to  which  a  complex  threshold  is  added.  Then  the  sigmoid  functions  of 
the  real  and  the  imaginary  parts  of  the  threshold  added  weighted  sum  are  evaluated  separately. 
Again,  for  the  output  layer  nodes,  the  output?  are  multiplied  by  a  constant  C .  Using  the  steepest 
descent  formula  (Eq.  6.1,  6.2).  we  get  t.he  adaptation  algorithm  of  our  new  MLP  equalizer  which 
is  described  in  Table  6.1  [30],  [40]. 

Simulation  are  conducted  to  examine  the  performance  of  MLP  equalizers.  The  equalizer  is 
implemented  by  the  new  MLP  structure  with  only  one  output  node.  The  input  data  to  the 
system  X{  are  assumed  to  be  independent  of  each  other.  The  delayed  input  sequence  x,_j,  where 
d.  is  channel  dependent,  is  used  as  the  training  sequence.  The  performance  of  MLP  equalizers  is 
evaluated  by  calculating  the  mean  square  error  (MSE)  £[(x  -  x)2]  and  the  average  svmbol  error 
rate  (SER)  of  the  quantizer  output.  The  eye  pattern  of  equalizer  outputs  around  certain  number 
of  iterations  is  shown  in  Figure  6.3. 

Figure  6.4  illustrates  the  performance  comparison  between  MLP  and  LMS-based  linear  transver¬ 
sal  equalizer  with  the  same  number  of  inputs.  The  structure  (the  number  of  nodes  in  the  hidden 
layer)  of  the  MLP  has  been  fine-tuned  through  experiment.  The  step  size  /j  of  the  LMS-based 
equalizer  is  also  optimized  (the  biggest  value  without  causing  divergence).  From  Fig.  6,  it  ap¬ 
pears  that  the  new  structure  of  MLP  works  no  much  better,  as  a  channel  equalizer,  than  the 
simple  linear  adaptive  equalizer.  As  a  matter  of  fact,  both  methods  end  giving  similar  results. 
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7  CONCLUSIONS 


The  purpose  of  this  paper  is  to  provide  a  tutorial  review  of  existing  blind  equalization  algorithms 
for  digital  communications.  Three  families  of  techniques  have  been  described,  namely,  the  Buss- 
gang  techniques,  the  polvspertra-based  techniques,  and  methods  based  on  nonlinear  equalization 
filters  or  neural  networks.  The  complexity  of  the  Bussgang  techniques  is  approximately  2 ,V  mul¬ 
tiplications  per  iteration,  where  N  is  the  order  of  the  linear  equalization  filter.  On  the  other 
hand,  the  polyspectra-based  techniques  require  approximately  j.Y3  multiplications  per  iteration. 
However,  as  it  has  been  demonstrated  in  the  paper,  the  polyspectra-based  techniques  achieve 
significantly  faster  convergence  rate  than  the  Bussgang  techniques.  Finally,  it  is  pointed  out  in 
the  paper  that  blind  equalizers  based  on  nonlinear  filters  or  neural  netwoiks  are  better  suited  for 
equalization  of  channels  with  nonlinear  distortions. 
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Table  4.1  Nonlinear  Functions  of  Bussgang  Iterative  Techniques. 


u(:)  =  [ui ,  (i),  •  •  • ,  u,\(i)]T  equalizer  taps 

y(i)  =  [z/(i),  -  -  ■  ,y(i  —  N  +  1)]T  input  to  the  equalizer  block  of  data 
At  iteration{i},  i  =  1,2,--- 

*(0  =  y(0 

e(0  =  $(<)  [*(*)]  -  *(*) 


u(t  +  i)  =  u(0  + 1*  y(1') 

e’(i) 

Algorithm 

Nonlinear  function:  <7[z(i)]  = 

LMS 

training 

mode 

x(t)  (linear) 

Decision 

Directed 

Mode 

*(*) 

Sato 

7  csgn  [i(t)J 

Benveniste- 

x(t)  +  l'i  (x{i)  -  x(i))  +  k2\x(i)  -  i(t)|- 

Goursat 

(7  csgn[x(t)]  - 

Godard 
p,q  =  2 

|x!.)|  ’  {lx(*)l  +  ^pI2:(0!p  1  |*(0I*P  '} 

Stop-and-Go 

x(t)  +  jA(x(i)  -  i(i))  +  jB(x(i)  -  x(i))’ 
(A,B)  =  (2,0),  (1,1),  (1,-1)  or  (0,0),  depending 
on  the  signs  of  DD  and  Sato  errors 

■=(*)) 
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Table  4.2  Comparison  of  Computational  Complexity 


Godard 

CRIMNO 
(memory  size  M) 

Adaptive  Weight  CRIMNO 
(memory  size  M) 
Version  I 

Version  I 

Version  II 

Real  Multiplication 

4N+5 

4N+8M+5 

MN+8M+4N+5 

4N+10M+5 

Table  6.1  Complex  MLP  adaptation  algorithm. 


1) .  Assign  small  random  complex  numbers  to  all  the  connections  and 

thresholds. 

2) .  Forward  propagate  inputs  through  the  network: 

'  Wh.J  +  V‘J  =  “>+!,]  +  J  ■ 

/=1 

“i+lj  =  /(«i+ 1 j)  +  3  •  /(af+ij)> 

where  i  =  1 ,  •  -  -  A/  (A/  is  the  number  of  layers),  /(•)  is  the  sigmoid 
function,  and  get  the  output, 


x  =  C  ■  a.\n- 

3) .  Present  the  training  signal  to  find  the  output  error, 

em  =  e[{  [l  -  (i'/C)2]  /C  +  jeQM  [l  -  (i«/C)2]  /C 
where  -  x. 

4) .  Find  the  backpropagation  error, 

=  dj  ■  [1  -  (afj)2]  +  3-d?]  ■  [1  - 

where 

™t+l 

§-i.j  —  y  \  '  *b+l .(• 

(=1 

5) .  Adjust  connections  and  thresholds: 

W,j,k(n  +  1)  =  Wi.jAn)  +  V  ■  C+i,](n)  ■  at}(n), 
vtJ(n  +  1)  =  vv(n)  +  f3  ■  e.^n). 

where  ‘V’  denotes  conjugate  operator.  The  momentum  term  can  also 
be  added. 

6) .  Back  to  Step  2. 
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Figure  2.  2  (a)  The  Bussgang  Algorithms:  Nonlinearity  is  in  the  Output  of  Equalization  Filter. 


Channel 


The  Polyspectra  Algorithms:  Nonlinearity  is  in  the  Input  of  Equalization  Filter. 


Figure  2.  2  (c)  Blind  Equalizers  with  Nonlinearity  Inside  the  Equalization  Filter. 


MS  Estimate  Under  Uniform  Distribution 


Figure  4.2  MS  Estimate  Under  Uniform  Distribution 


MS  Estimate  Under  Laplace  Distribution 


X 

Figure  4.3  MS  Estimate  Under  Laplace  Distribution 
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-3-2-10  1  2  3 


Figure  4.5  MAP  Estimate  Under  Laplace  Distribution 


95 


AMSE  (dB)  AMSE  (dB) 


Figure  4.6  Comparison  of  the  adaptive  weight  CRIMNO  algorithm  (szl)  with  Godard’s 
algorithm  of  different  step-size  (sz3  is  the  optimum  step-size):  (a)  the  real 
channel;  (b)  the  synthetic  channel. 
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S  Error  (d 


(c)  xio4 

Figure  4.7  Effect  of  memory  size  M  on  the  adaptive  weight  CRIMNO  algorithm: 

(channel  2)  (a)  Mean  square  error;  (b)  Probability  of  error;  (c)  Intersymbol 
interference. 
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(a)  (b) 


Figure  4.8  Eye  pattern  of  adaptive  weight  CRIMNO  algorithms  with  different  memory  size 
M  at  iteration  20000.  (a)  Godard;  (b)  Adaptive  weight  CRIMNO  (M=2j;  (c) 
Adaptive  weight  CRIMNO  ( M=4)\  (d)  Adaptive  weight  CRIMNO  ( M=6 ). 
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Figure  5.1  Diagram  of  four  parallel  equalized  systems  with  additive  noise. 
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MOMENTS,  CUMULANTS  AND  SOME  APPLICATIONS  TO 
STATIONARY  RANDOM  PROCESSES 


BY  DAVID  R.  BRILLINGER* 

University  of  California,  Berkeley 

The  paper  ranges  over  some  basic  ideas  concerning  moments  and 
cumulants,  focusing  on  the  case  of  random  processes.  Uses  of  moments 
and  cumulants  in  developing  large  sample  approximate  distributions,  in 
system  identification  and  in  inferring  causal  connections  of  a  network  of 
point  processes  are  presented. 

1.  Introduction.  Moments  and  cumulants  find  many  uses  in  main 
stream  statistics  generally  and  with  random  processes  particularly. 
Moments  reflect  the  parameters  of  distributions  and  hence,  as  via  the 
method  of  moments,  may  be  used  to  estimate  distributional  parameters. 
Moments  may  be  employed  to  develop  approximations  to  the  statistical 
distributions  of  quantities,  such  as  sums  in  central  limit  theorems  and  asso¬ 
ciated  expansions.  Moments  may  be  used  to  study  the  independence  of 
variates.  Moments  unify  diverse  random  processes,  such  as  point 
processes  and  random  fields,  and  diverse  domains,  such  as  the  line  or 
space-time. 

2.  Ordinary  case.  One  can  begin  by  asking:  What  is  a  moment?  To 
provide  an  answer  to  this  question,  consider  the  case  of  the  0-1  valued 

^Research  partially  supported  by  NSF  Grant  DMS-8900613 
AMS  1980  subject  classifications.  Primary  62M10,  62M99. 

Key  words  and  phrases.  Coherence,  cumulant,  moment,  partial  coherence,  point  pro¬ 
cess,  system  identification,  time  series. 
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variates  X ,  Y ,  Z .  For  these  variates 

E{XYZ }  =  Prob{X  =  1,  Y  =  1,Z  =  1} 

This  provides  an  interpretation  for  a  (third-order)  moment  in  terms  of  a 

quantity  having  a  primitive  existence,  namely  a  probability.  Higher-order 
moments  have  a  similar  interpretation.  One  can  proceed  to  general  ran¬ 
dom  variables,  by  noting  that  these  may  be  approximated  by  step  (or  sim¬ 
ple)  functions,  see  eg.  Feller  (1966),  page  107. 

Next  one  can  ask:  What  is  a  cumulant?  One  answer  is  to  say  that  it 
is  a  combination  of  moments  that  vanishes  when  some  subset  of  the  vari¬ 
ates  is  independent  of  the  others.  Suppose  for  example  that  X  is  indepen¬ 
dent  of  (Y,Z ).  The  third  order  joint  cumulant  may  be  defined  by 


cum  {X,  y,  Z  }  =  (1) 

E[XYZ)  -  E{X)E{YZ)  -E[Y)E{XZ)  -  E{Z)E{XY }  +2 E[X)E{Y)E{Z) 


By  substitution  one  quickly  sees  that  this  last  expression  vanishes  in  the 
case  that  X  is  independent  of  (Y ,  Z). 


Expresion  (1)  gives  one  definition  of  a  joint  cumulant.  An  alternate 
way  to  proceed  is  to  state  that  that  cumulant  is  given  by  the  coefficient  of 
in  the  Taylor  expansion  of 


log  E{ei<oX+ P^Z)} 


supposing  one  exists. 


Taking  the  log  here  converts  factorizations  into  additivities  and  one  sees 
immediately  why  the  joint  cumulants  vanish  in  the  case  of  independence. 

Streitberg  (1990)  sets  down  a  sequence  of  conditions  that  actually 
characterize  a  cumulant.  These  are: 

1.  Symmetry 
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cum  {X  j,  X2,  •  •  •  }  =  cum  [X2,  X  j,  •  •  •  } 

2.  Multilinearity 

cum  {aX  j,  X2,  •  •  •  }  =  a  cum  {Xj,  X2,  •  •  •  } 


cum  [X j+y  j,  X2,  ■  ■  •  }  =  cum  {X ,,  •  •  •  }  +  cum  {Ylt  •  •  •  } 

3.  Moment  property,  if  the  moments  of  X  and  Y  are  identical  up  to  order 
k 


cum  { X }  =  cum  m 

4.  Normalization,  in  the  expansion  in  terms  of  moments 

cum  {Xj,  •  •  •  ,  X*  }  =  E  {X  j  •  •  •  Xk  }  + 

5.  Interaction,  if  a  subset  is  independent  of  the  remainder 

cum  {Xl5  •  •  •  ^  }  =  0 

Cumulants  provide  a  measure  of  Gaussianity.  If  the  variate  X  is  nor¬ 
mal,  then 

cumk  {X }  =  0  (2) 

for  k  >  2.  (Here  cumk  denotes  the  joint  cumulant  of  X  with  itself  k 

times.)  Putting  (2)  together  with  the  fact  that  the  normal  distribution  is 
determined  by  its  moments,  provides  a  particularly  brief  proof  of  the  cen¬ 
tral  limit  theorem.  Namely  suppose  that  Xj,X2,  *  *  *  are  independent 
and  identically  distributed  with  E  {X }  =0  and  var{X}  =  1.  Suppose  all 
moments  exist  for  X .  Consider 

S„=(X,+  •••  +Xn)l'fc  (3) 

Then 

cumk  {Sn  }  =  n  cumk  [X  }  /  n  2 

which  tends  to  0  for  &  >  2,  as  n  tends  to  infinity,  and  in  consequence  Sn 
has  a  limiting  normal  distribution. 


no 


An  error  bound  may  be  given  for  the  degree  of  approximation  of  the 
distribution  of  a  random  variable  by  a  normal,  via  bounds  on  the  cumu- 
lants.  In  Rudzkis  et  al.  (1978)  the  following  result  is  developed.  Con¬ 
sider  a  variate  Y  with  mean  0  and  variance  1.  Suppose  that 


I  cumk  { Y } 


< 


H  (k  !)1+v 
a*-2 


for  some  v  >  0,  H  >  1 ,  then  in  the  interval  0  <  u  <  8/H 


where 


sup  I Prob  [Y  <  u  }  -  0(w )  I 
u 


< 


18  H 
5 


8  =  — 
7 


V  2A 


l/(l+2v) 


In  the  case  of  a  sum,  such  as  (3),  one  can  take  A  =  for  example. 


3.  Time  series  case.  Consider  a  stationary  time  series  X(t)  with 
domain  t  =  0,  ±1,  ±2,  •  •  •  .  If  the  k-th  moment  exists,  from  the  sta¬ 
tionary,  the  moment  function 

E[X(t+ux)-  •  •X(*-w4_1)Y(0} 
will  not  depend  on  t ,  nor  will  the  associated  cumulant  function 

c*(w  t»  '  '  '  ’  uk-0 

=  cum  [X(t+u  j),  •  •  •  (4) 

The  Fourier  transforms  of  these  ck(.)  give  the  higher-order  spectra  of  the 
series.  These  functions  may  be  estimated  given  stretches  of  data. 

It  was  indicated,  by  property  5  above,  that  a  joint  cumulant  measures 
statistical  dependence.  This  suggests  formalizing  the  intuitive  notation  that 
values  at  a  distance  in  time  are  not  strongly  dependent  via 

X  ‘  '  X  lq(«,,  •••,  m*_i)I  <  00  (5) 

«i  «*-i 


ill 


for  k  =  2,  •  •  •  .  It  is  now  direct  to  provide  a  central  limit  theorem  for 
sums  of  values  of  a  stationary  time  series.  One  has 

cumk  (£*(/)/ <T) 

1 


=  £  •••  I ct a,-/*, 

'l 


z 


t  «, 


Z  (u  1  ’ 
«*- 1 


,  M*_i) 


/  r 


it/2 


=  Zc2(k)  k  =  2 

u 

and 

->  0  it  >  2 

following  (5),  giving  the  limit  normal  distribution. 

Another  aspect  of  the  use  of  cumulants  is  that  a  calculus  exists  for 
manipulating  polynomials  in  basic  variates.  Suppose  that 

y  =*(*„  ■■■  ,xL) 

=  !«„..  4*i'  •  ■  •  ^  (6) 

One  has  directly  from  (6)  that 

...  ••• 

m 

but  perhaps  more  usefully,  there  are  rules  due  to  Fisher,  see  Leonov  and 
Shiryaev  (1959),  Speed  (1983),  providing  an  expression 

cumk  { Y }  =  £  7o  cum  ixj  :  J  6  °i }  •  -  ‘  cum  [Xj  :  j  e  cp  } 
o 

where  o  =  (Oj,  *  •  •  ,  ap^  is  a  partition  of  subscripts  into  blocks  and  the 
ya  are  coefficients. 

A  time  series  analog  of  an  expansion,  like  (6)  for  ordinary  variates,  is 
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provided  by  the  Volterra  expansion 

Y(t)  =  a0  +  £  ax{t-u)X(u)  +  £  a2(t-u  l,t-u2)X(u  x)X{u2)  +  ■  •  <7) 

u  ux,Ui 

Using  the  Cramer  representation  of  the  process,  namely 

XU)  =  \ei,xdZx  (X) 

(7)  may  be  written 

U0  + 1  e‘,xA  x(X)dZx(X)  +  JJ  e"(X,+X2)  A2(Xl,X2)dZx(Xl)dZx(^2)  +  '  •  • 

in  terms  of  the  Fourier  transforms  of  the  a^.),  a 2(.),  *  *  *  .  This  form 
often  simplifies  the  development  of  particular  analytic  results. 

Consideration  now  turns  to  the  use  of  moments  and  cumulants  in  the 
identification  of  nonlinear  systems.  In  the  case  of  a  polynomial  system 
like  (7),  Lee  and  Sehetzen  (1965)  develop  estimates  of  the  functions 
ax(.),  a2{.),  ■  •  ■  via  empirical  moments  of  the  form 

i  £*«+«,)  •••  XU+ut)Y(t) 

1  t= 0 

for  the  case  that  the  input,  X  (.),  is  Gaussian  white  noise. 

For  the  case  of  stationary  Gaussian  input  and  a  quadratic  system 

Y(t)  =  a0  +  X  al(t-u)X(u)  +  £  a2(t-u  ^t-u^X (u  x)X (u2)  +  noise 

U  U\,U2 

Tick  (1961)  developed  an  estimation  procedure  as  follows.  Define  the 
cross-spectrum  and  cross-bispectrum  via 

cum  {dZx(X),dZY{\i)}  =  b(X+[i)fXY(X)dXd\i 

cum  {dZxQ^x\dZx(X2),dZY(^X2)}  —  b(Xx+X2+X2)f  XXY{Xx,X2)dXxdX2dX2 
respectively.  One  has 

f  YX(X)  —  A  j(X)fxx(X) 

f xx y (-^i  “^2)  -  ZA 2(-Xx-X2)f xx^Of xxO^i) 
relations  from  which  estimates  of  the  transfer  functions,  A ,  may  be 
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developed,  based  on  estimates  of  the  spectra  that  appear. 

Another  system  that  may  be  identified,  in  a  like  manner,  takes  the 
form,  for  input  X  (.)  and  output  Y  (.), 

U(0  =  Za{t-u)X{u) 
u 

V(t)  =  G[U(t)] 

Y (t)  =  |i  +  b(t-u)V(u)  +  noise 

u 

i.e.  involves  an  instantaneous  nonlinearity,  G[.],  and  two  linear  filters.  In 
the  case  that  X  (.)  is  stationary  Gaussian,  one  can  develop  the  relationships 

fyxa)  =  L]A(X)B 

fxXY^“  1,^2)  =  ^2)8  XX^Of 

where  L  j,  L2  are  constants.  See  Korenberg  (1973)  and  Brillinger  (1977). 
Estimates  of  the  identifiable  unknowns  may  be  developed  based  on  est; 
mates  of  the  spectra  appearing. 

4.  Point  process  case.  Consider  isolated  points,  xk,  scattered  along 
the  real  line.  Let  N  (/)  count  the  number  in  (0,f  ]  and  dN{t)  the  number  in 
the  small  interval  ( t,t+dt ].  Typically  dN(t)  will  be  0  or  1. 

The  k-th  order  product  density  of  the  point  process  N  (.)  is  pk(.) 
given  by 

E{dN(t])  •••  dN(tk)} 

=  Prob  {dN (/j)=l,  •  •  •  ,  dN(tk)=  1} 

=  Pk(‘ v  ■  •  •  ’‘kWh  '  '  ’  dtk 

for  tj,  •  •  •  ,  tk  distinct  and  ^  =  1,2,  •  •  •  .  This  relates  to  the  moments 


of  the  process  as  foliows.  Write  =  N(N- 1)  •  •  •  (N-k  + 1),  then  the 
k-th  factorial  moment  of  N(t)  is 

t  t 

E[N‘Ak))  =  \  ■  ■  ■  J />*«,,  ■  ■  -,tt)d I,  •  •  ■  dtt 
0  0 

The  corresponding  cumulant  density  is  given  by 

cum  [dN  (t  j),  •  •  •  ,  dN(tk)}  =  qk{tx,  ■■■,tk)dt]  •••  dtk 
for  r,,  •  •  •  ,  tk  distinct.  The  k-th  factorial  cumulant  of  N(t)  is  now 

t  t 

J  J  Qk  1  »  »  i  ‘  dtk 

0  0 

In  the  case  of  a  Poisson  process,  the  product  densities  will  be  given  by 

Pk(t],  •  •  •  ,  tk)  =  p(tl)  ■  ■  ■  p(tk ) 

with  p  ( t )  the  intensity  of  the  process  and  the  cumulant  densities  will  van¬ 
ish  for  k  >  1. 

As  an  example  of  the  use  of  moments  to  derive  an  alternate  limit 
theorem,  suppose  one  has  N ^ (.),  •  •  •  ,  Nn(.)  i.i.d.copies  of  a  point  process 
N(.).  Suppose  they  are  superposed  and  rescaled  to  form  the  point  process 

M„(r)  =  N,(-)+  •  ■  •  +Nn(~) 
n  n 

The  k-th  factorial  cumulant  of  this  process  is 

tin  tin 

j  ‘  '  •  j  n  •  •  •  *hW\  ’  ‘ '  dtk 

o  o 

=  «(-)* <7*(0,  ■  •  •  ,0) 
n 

for  large  n,  assuming  continuity  at  0.  This  cumulant  tends  to  tq 0)  for 
k  =  1  and  to  0  for  k  >  1  and  in  consquence  one  has  a  Poisson  limit  for 
the  variate  Mn  (t ). 

5.  Extensions.  The  preceding  results  and  definitions  extend  quite 
directly  to  the  cases  of:  a  spatial  process  X  ( x  ,y ),  a  marked  point  process 
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£  Mj  8(t  —Xj ),  a  hybrid  process  X  (x; )  and  a  line  process,  for  example. 
j 

6.  An  example.  In  this  section  second-order  moments  and  cumulants 
are  employed  to  infer  the  causal  connections  amongst  some  contemporane¬ 
ous  point  processes. 

Consider  the  stationary  bivariate  point  process  (M,  N)  with  points 
and  yt  respectively.  In  what  follows  an  estimate  of  the  product  density  of 
order  2  will  be  needed.  The  parameter  is  defined  via 

pMN(u)  dudt  =  E  [dM (t+u)dN(t)} 

=  Prob{dM(t+u)=  1,  dN(t)  =  1} 

This  last  suggests  basing  an  estimate  on  the  count 

#{  -  Y/  ~  «  I  <  y)  (8) 

for  some  small  binwidth  h.  Details  are  given  in  Brillinger  (1976).  One 
result  is  that  it  appears  more  pertinent  to  graph  the  square  root  of  the  esti¬ 
mate.  In  the  case  that  the  processes  M  and  N  are  independent,  one  will 
have  pMN{u)  =  pMpN ,  which  possibility  may  be  examined  via  the  statistic 
(8). 

« 

The  suggested  estimate  will  be  illustrated  with  some  neurophysiologi¬ 
cal  data.  Concern  in  the  experiment  was  with  auditory  paths  in  the  brain 
of  the  cat.  To  collect  data,  microelectrodes  were  inserted  with  location 
tuned  to  sound  response.  Data  was  recorded  when  the  neurons  were  firing 
spontaneously.  Also  responses  were  evoked  experimentally  by  200  msec, 
noise  bursts,  that  were  applied  every  1000  msec.,  via  speakers  inserted  in 
the  ears.  The  firing  times  of  8  neurons  were  recorded.  Figure  1  provides 
the  data  itself  for  4  selected  cells,  2  in  the  case  with  stimulation,  2  when 
the  firing  is  spontaneous.  Each  horizontal  line  plots  firings  as  a  function 
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of  time  since  stimulus  initiation  in  a  1000  msec,  time  period.  The 
stimulus  was  applied  505  times  in  these  examples.  In  the  stimulated  case 
one  notices  vertical  darkening  corresponding  to  excess  firing  just  after  the 
stimulus  has  been  applied.  Neurophysiologists  speak  of  locking.  In  the 
spontaneous  case  no  locking  is  apparent.  There  is  some  evidence  of  non- 
stationarity  in  this  case. 

Figure  2  provides  the  square  root  of  a  multiple  of  (8).  The  horizontal 
dashed  lines  are  ±2  standard  errors  about  a  horizontal  line  corresponding 
to  independence  in  the  stationary  case.  One  infers  that  the  cell  pairs  are 
associated  in  each  case.  However  in  the  stimulated  case  one  has  to  wonder 
if  the  apparent  association  of  units  6  and  7  is  not  due  to  the  fact  that  the 
cells  are  being  stimulated  at  the  same  times. 

Fourier  techniques  provide  one  means  to  address  this  concern.  Write 

4(X)  =  £  e-XT* 

4m  =  i 

/ 

for  the  data  0  <  yl  <  T .  For  X  *  0  one  has 

£|4(X)40))  =2 kT  fMN(X) 
with  fMN(.)  the  cross-spectrum  given  by 

=  -^rj  e~iluqMN{u)  du 

A  useful  quantity  for  measuring  the  association  of  M  and  N  may  now  be 
defined.  It  is  the  coherence, 

i  /w  a>  i 2  =  i  fus  w 2 1  fMM  m/w  m 

with  the  interpretation 

lim  lew  {dhO 0,  }  I2 

T  — >oo 

It  satisfies  0  <  \RMN(X)\2  <  1,  with  greater  association  corresponding  to 
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values  nearer  1.  Figure  3  provides  coherence  estimates  for  the  cell  pairs 
of  Figure  2.  This  evidence  of  association  is  in  accord  with  that  of  Figure 
2.  The  dashed  horizontal  line  provides  the  95%  point  of  the  null  distribu¬ 
tion  of  the  coherence  estimate. 

To  return  to  the  driving  question  of  how  to  "remove"  the  effects  of 
the  stimulus,  one  can  consider  the  partial  coherence.  This  has  the  interpre¬ 
tation 

lim  I  corr  [d^  -  0 ulj,  dj;  —  $dj }  1 2 

7-->oo 

with  a,  P  regression  coefficients  and  S  referring  to  the  process  of  stimulus 
times.  Suppressing  the  dependence  on  X  hie  partial  coherence  is  given  by 

Wmn\s  |2  where 

n  RMN  ~  rms  rsn 

KMN  I  s  ~  I - =5 - 7* 

V(l-lffA,jl  2K1-I«aw  1  > 

Figure  4  provides  the  estimated  partial  coherence  of  neurons  6  and  7  in 
the  stimulated  case.  The  level  apparent  in  the  top  graph  of  Figure  3  has 
fallen  off  substantially  suggesting  that  the  association  evidenced  in  Figures 
2  and  3  is  due  to  the  stimulus. 

For  interests  sake  Figure  5  provides  the  coherence  estimate  for  neu¬ 
rons  3  and  4  in  the  case  of  applied  stimulation.  One  might  wonder  if  they 
would  become  more  strongly  associated  in  the  presence  of  stimulation. 
The  results  do  not  suggest  that  this  has  happened. 

7.  Conclusions.  In  summary,  moments  and  cumulants  may  be 
employed  to  develop  approximations  to  distributions,  approximations  such 
as  the  norma!  or  the  Poisson.  They  may  be  employed  in  system 
identification.  They  may  be  used  to  infer  the  "wiring"  diagram  of  a  col¬ 
lection  of  interacting  point  processes. 
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The  approach  presented  is  nonparametric,  not  based  on  special  sto¬ 
chastic  processes  described  by  finite  dimensional  parameters.  Brillinger 
(1991)  provides  a  variety  of  references  concerning  the  work  pre  1980  on 
higher  moments  and  spectra. 

Acknowledgements.  The  neurophysiological  data  were  provided  by 
Alessandro  Villa.  Terry  Speed  mentioned  the  Streitberg  (1990)  result. 
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Figure  Legends 

Figure  1.  Rastor  plot  of  the  firing  times  of  4  neurons  in  successive  1000 
msec,  periods.  There  are  505  horizontal  lines  of  firing  times. 

Figure  2.  The  square  root  of  a  multiple  of  the  quantity  (6).  Were  the 
processes  independent  and  stationary  then  about  5%  of  the  values  should 
lie  outside  the  band  defined  by  the  two  horizontal  dashed  lines. 

Figure  3.  Estimated  coherences  of  cells  6  and  7  in  the  stimulated  case  and 
3  and  4  when  the  firing  is  spontaneous. 

Figure  4.  Estimated  partial  conerence  of  cells  6  and  7  "removing"  the 
effect  of  the  stimulus. 

Figure  5.  The  estimated  coherence  of  c^lls  3  and  4  in  the  case  of  stimula¬ 
tion. 


121 


100  300  500  0  100  300  500 


Spike  times  following  stimulus 

Stimiated  unit  6  Stimlated  unit  7 


o 


0  200  600  1000  0  200  600  1000 
lag  (msec)  lag  (msec) 


Spontaneous  unit  3  Spontaneous  unit  4 


o 


0  200  600  1000  0  200  600  1000 


lag  (msec) 


122 


lag  (msec) 


124 


Moment-based  oscillation 
properties  of  mixture  models 


Bruce  Lindsay 
Department  of  Statistics 
Pennsylvania  State  University 
and 

Kathryn  Roeder 
’Department  of  Statistics 
Yale  University 


Abstract 

Consider  finite  mixture  models  of  the  form  n(r ;  Q)  —  f  /(.r;  ft)dQ{9) 
where  /  A  a  parametric  density  and  Q  is  a  discrete  probability  mea¬ 
sure.  An  important  and  difficult  statistical  problem  concerns  the  de¬ 
termination  of  the  number  of  support  points  (usually  known  as  com¬ 
ponent.')  of  Q  from  a  sample  of  observations  from  g.  For  an  important 
class  of  exponential  family  models  we  have  the  following  result:  if  P 
has  more  than  p  components,  and  Q  is  an  appropriately  chosen  p 
component  approximation  of  P  then  g(x\  P)  —  g(x:  Q)  demonstrates  a 
prcsciibed  sign  change  behavior,  as  does  the  corresponding  difference 
in  the  distribution  functions.  These  strong  structural  properties  have 
implications  for  diagnostic  plots  for  the  number  of  components  in  a 
finite  mixture. 


1  Introduction 

Consider  a  family  of  univariate  probability  densities  f(x:ft).  with  respect  to 

smile  n  finite  uexi'iire  r/*; (./■).  parameterized  by  ft  t  H.  Frequently,  interest 

"TC-  amine.-  v, ere  supported  by  NSF  grants  DMS-910GS95  to  Lindsay  and  DMS- 
OOni-T.'i  to  Rim  ,i.  : 
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lies  in  mixtures  of  such  densities.  The  random  variable  X  is  said  to  have  a 
mixture  distribution  G(-:Q)  if  .Y  lias  density 

g{r\Q)  =  J  f(x-,e)dQ(0),  (i) 

and  the  mixing  distribution  Q  is  a  probability  measure  on  fi.  If  Q  has  a 
finite  number  of  support  points  //  =  v{Q)  then  we  say  Q  is  a  finite  mixing 
distribution  and  we  write  Qu  =  Y  ^ H@j)  witli  8\,.  .  .  ,9U  being  the  support 
points  (often  called  components)  and  ttj being  the  weights. 

A  problem  of  longstanding  interest  in  such  models  is  inference  on  the 
unknown  value  of  At  the  simplest  level,  this  is  the  problem  of  deter¬ 

mining  if  v  equals  1,  the  one  component  model,  or  is  greater  than  1,  the 
multicomponent  model.  Shaked  (1980)  presented  important  results  for  this 
problem  when  the  component  densities  /(j;  6)  are  one  parameter  exponential 
family.  We  extend  his  results  in  two  directions,  generalizing  to  the  discrim¬ 
ination  between  u  =  p  versus  v  >  p,  and  moving  beyond  the  one  parameter 
exponential  family  to  the  normal  mixture  model  in  which  each  component 
has  a  different  mean,  but  the  same  unknown  variance. 

Here  we  summarize  Shaked’s  sign  crossings  results.  Suppose  we  wish  to 
contrast  a  multicomponent  model  with  a  plausible  one  component 

model  /{jt.Q).  Choose  9  —  9‘  for  the  one  component  model  so  that  the 
observed  variable  A’  has  the  same  mean  under  both  densities: 

J  rg(.r.Q)  <!-,(*)  =  J  .r  f(.r,  9')  ,h(. r). 

Our  notation  for  this  last  equation  will  be  E[A";Qj  —  £[.Y;#'].  Shaked 
showed  that  </(.v:(>)  —  f(.r:9‘)  has  exactly  two  sign  changes,  in  the  order 
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(  +  ,—,+),  as  x  traverses  the  sample  space.  That  is,  g(x\  Q)  has  heavier  tails 
than  f(x:  O').  Moreover,  the  difference  in  distribution  functions  G(x\Q)  — 
F(x:  O')  has  exactly  one  sign  change,  in  the  order  (  +  ,  — ). 

We  extend  his  results  as  follows:  let  P,  the  nominal  true  mixing  distribu¬ 
tion,  satisfy  v{P)  >  p;  choose  Qp ,  a  candidate  p-point  probability  measure, 
such  that  it  satisfies 

E[Xk;P]  =  E[Xk:Qr].  t  =  0, 1, . . .  2p  —  1.  (2) 

(In  Section  2  we  show  how  to  solve  for  Qp.)  Then,  in  Theorem  3.2,  we  show 
that  g(x\  P)  —  g(x:  Qp)  has  exactly  2;>sign  changes  in  the  order  (  + 
unless  it  is  identically  zero  (the  case  of  nonidentifiable  P).  An  exact  sign 
change  result  for  the  difference  in  distribution  functions  is  also  given  in  Sec¬ 
tion  3.  In  Section  4,  these  results  are  extended  to  normal  densities  with 
unknown  variance. 

Before  proceeding  to  the  mathematical  verification  of  these  results,  we 
offer  a  few  brief  comments  on  their  potential  application.  In  Figure  1.  we 
plot  [g(s:  P)  —  </(.;•;  Q:)}  / \J g{. r:  P)  for  the  case  when  f{x:0)  is  Poisson,  P 
puts  mass  1/3  each  at  (1.3  and  5).  and  0>  is  constructed  to  match  moments 
as  specified  in  (2b  We  note  the  clear  trimodality  of  this  function,  in  constrast 
to  the  unimodality  of  the  density  g(x:  P)  (Figure  2). 

Shaked  demonstrated  that  his  sign  change  results  could  be  used  for  di¬ 
agnostic  chocks  to  determine  if  tie-  data  were  from  a  mixture  of  specified 
exponential  family  densities  rather  than  a  one  component  model.  These 
ideas  were  fui liter  developed  in  Lindsay  and  Boeder  (1092).  When  interest 
lies  in  assessing  the  number  of  components  in  a  finite  mixture,  the  oscillation 
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results  obtained  in  this  article  have  clear  implications  for  diagnostics  plots. 
In  a  companion  paper  these  results  are  used  to  develop  diagnostic  plots  for 
the  case  of  normal  mean  mixtures  with  unknown  variance  (Roeder  1992). 

2  Background 

2.1  The  models  under  investigation 

We  will  be  interested  in  component  densities  /(.r:  9)  where  both  tr  and  6  have 
ranges  in  the  real  numbers,  say  x  <=TCR  and  9  £  [/.  u]  C  fh  Furthermore. 
/(•:  •)  satisfies  regularity  conditions  which  will  be  expounded  in  this  subsec¬ 
tion.  Although  the  most  important  application  of  the  results  to  follow  is  the 
one  parameter  exponential  family,  the  results  readily  extend  to  other  cases 
of  interest  for  which  we  need  the  following  terminology. 

A  real  function  of  two  variables.  I\{x.9),  ranging  over  linearly  ordered  sets 
T  and  fl  is  said  to  be  totally  positive  (TP)  if  certain  determinantal  inequalities 
hold  (Karlin  19GS.  p.  11.  15).  For  instance,  the  functions  exp (8x)  and  I(.r  < 
8)  are  IP.  In  addition,  many  density  functions  occuring  in  statistical  theory 
are  TP.  For  example,  the  one  parameter  exponential  family  with  density 
function  I\(.r.ft)  =  exp {#.r  —  ?,•(#)}.  Other  examples  include  the  noncentral-i 
and  noncentral- densities.  In  fact,  all  of  the  densities  mentioned  above  are 
strictly  TP  (STP:  Karlin  19G8.  p.  12).  For  a  more  extensive  list,  see  Karlin 
1908.  p.  Il7j.  We  will  say  that  f(x:9)  i>  an  STP-model  if  f[x:9)  is  strictly 
totally  paMlivc  iii  x  and  9. 


130 


2.2  Background  on  moments  and  exponential  fami¬ 
lies 

In  order  to  apply  our  results  in  a  particular  model  we  need  to  establish  two 
important  structural  features  for  the  component  densities  f(x;9).  Our  first 
requirement  is  as  follows:  suppose  that  P  is  a  mixing  distribution  with  p 
or  more  support  points.  Then  we  need  to  be  able  to  construct  a  p-point 
distribution  Qp  such  that  the  first  2p  —  1  moments  of  g(x ;  P)  and  p(x;  Qp ) 
match,  satisfying  (2).  Fortunately,  there  exists  an  important  class  of  expo¬ 
nential  families  (the  quadratic  variance  class)  in  which  Qp  satisfying  (2)  can 
be  shown  to  exist.  This  class  includes  the  normal,  gamma.  Poisson  and  bi¬ 
nomial  distributions.  The  following  is  a  brief  review  of  techniques  found  in 
Lindsay  (1989). 

In  the  quadratic  variance  family  of  exponential  family  models  (Morris 
1983).  for  each  k,  there  exists  a  polynomial  of  degree  k.  call  it  £j.(x),  such 
that 

J  Sk(x)f(*:e)<h(x)  =  (»  -  n 0)k  (3) 

for  mean  value  parameter  p.  The  choice  of  p0  is  arbitrary  so  we  set  it  to 
zero.  For  example,  in  the  Poisson  with  mean  p.£[A']  =  p,  £[A'(A*  —  l)j  = 
p2,  £[A  ( A  —  1)(A  —  2)j  =  p3  and  so  forth.  In  addition,  a  classical  moment 
result  indicates  that  for  a  given  distribution  P  with  no  fewer  than  p-points 
of  support,  there  exists  a  unique  distribution  Ov  with  exactly  p-points  of 
support  such  that 

yV'Wp)  -  /  A'£(p).  k=  1 . 2/1-1.  (4) 

dims  integrating  both  sides  of  (3;  wit  h  reaper  t  to  iIQp(fi)  and  d  P{  p ) .  and 
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using  (4)  yields 


E[‘k(X)-,P]  =  E[Zk(Xy.Qp],  k  =  1, ...  ,2p  —  1 .  (5) 

Finally,  the  map  taking  (1,  x, .  . . ,  x2p_1 )  — >  (so(j'),  si(x)'  •  •  ■  *  £>p-i(-0)  is  in¬ 
vertible,  so  (5)  implies  (2). 

More  details  on  solving  (5)  for  Qp  are  given  in  Lindsay  (1989).  The  solu¬ 
tions  can  be  obtained  algebraically  for  p  =  2.  For  arbitrary  p,  the  problem 
involves  solving  a  degree  p  polynomial  for  its  p  real  roots. 

3  One  parameter  models 

In  this  section  we  obtain  sign  change  results  for  one  parameter  models.  The 
following  notation  (Karlin  19G8,  p.  20)  will  be  used.  Let  a(x)  be  defined  on 
/  where  I  is  a  subset  of  the  real  line.  The  number  of  sign  changes  of  a  in  / 
is  defined  by 

S-(a)  =  sup  S~[u(xi) . u(xm)j  (6) 

where  S~(y i .....  ?/,„)  is  the  number  of  sign  changes  of  the  indicated  sequence, 
zero  terms  being  discarded,  and  the  supremum  is  extended  over  all  sets 

x,  <  x2  <  . . .  <  xm  (x,  £  /):  m  <  oc.  (7) 

We  assume  throughout  that  f(x\0)  is  an  STP  kernel  and  that  P  and 
Or  satisfy  (2).  The  following  notation  will  be  used  throughout  this  section: 
0\  =  !/(■'  ■  P)-  lU  =  <A*\ Qv)-  G\  -  G(.r.  P)  and  G2  =  G(x;  Qp). 

Remark  In  the  following  result  we  will  give  exact  sign  change  results  for 
</]  —  f/j  with  the  proviso  "the  difference  </i  —  </.-  is  not  identically  zero'.  If 
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such  an  equality  in  densities  occurs,  it  is  clear  that  there  is  an  identifiabilit y 
problem:  both  P  and  Qp  are  generating  the  same  distribution.  The  results  of 
Lindsay  and  Rocder  (1992)  can  be  used  to  determine  exactly  when  this  will 
occur.  If  the'  sample  space  is  infinite,  it  will  not  occur.  If  the  sample  space  has 
-V  points,  then  p-point  distributions  Qp  are  identifia!  le  when  p  <  (.V  —  l)/2, 
and  so  g\  —  g2  cannot  be  identically  zero.  If  both  P  and  Qp  have  more  than 
(.V  —  1 ) / 2  points,  then  g\  —  g2  cannot  have  exactly  2 p  sign  changes,  since  we 
can  have  at  most  .V  —  1  sign  changes  as  we  traverse  the  sample  space.  Thus 
our  result  proves  that  P  and  Qp  generate  the  same  density.  ■ 


Lemma  3.1  Provided  g\  —  g2  is  not  identically  zero,  S  (g j  —  g2)  <  2 p. 


Proof  Define  the  measure  </\(4)  by 


d\(0)  =  d(P  +  Qp){0). 


Let 


P' 


and 


P;{^})/;P({tt})^0P({f?}))  if  0  e  {^, . 0P } 

1  else. 


t 


i  ft 


+  if ^ e  {«i . ^>} 

0  else. 


Then  ]’  and  </‘  are  versions  of  the  Raduii-Xikodym  derivatives  dP/d\  and 
dQ  P  '  d\  .  s,,  th.it  -  g.  -  f  f  ( r  :#):/'"(# )  -  q’(ft)  d(P  -f  Q,,)(fi). 

Wr  m>w  apply  I  heurem  3.1  (b)  of  Karlin  ( 19GS).  noting  that  p‘\d)  —  q'{P) 
equals  oin  except  pnssibl\  at  the  suppuit  of  Qp.  whi  ii’  i;  can  be  negative. 
II'ii’  e  it  uiiderg'.es  a  maximum  id  2 1>  sign  changes.  Karlin  s  iesult  then 
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implies  that  integration  with  respect  to  the  STP  kernel  f(x\6)  will  result  in 
a  function.  </t  —  g>,  with  no  more  sign  changes  in  x  than  p"{9)  —  q‘(0)  has 
in  0  relative  to  d\.  This  establishes  an  upper  bound  of  2 p  sign  changes  in 

</[  -  92-  ■ 

Theorem  3.2  Provided  gx  —  g2  ts  not  identically  zero,  S~(gi  —  g2)  =  2 p, 
with  sign  char,gcs  in  the  order  +). 

Proof  From  Lemma  1,  we  obtain  an  upper  bound  on  the  number  of  sign 
changes  of  2 p.  Because  f  xh(gi  —  g2)(x)di'(x)  =  0,  for  k  =  1, . . . ,  2p  —  1,  any 
polynomial  A(r)  of  degree  <  2p  —  1  satisfies 

J  A(x){g1  -  g2){r)d‘ y(x)  =  0. 

Suppose  -  g2)  <  2 p  -  1.  Then  we  can  construct  a  polynomial  A{x) 

that  matches  g:  —  g2  in  sign  (i.e..  it  has  single  roots  exactly  at  the  roots  of 
g i  -  g2).  It  follows  that  A(x){gi  -  g2)  >  0,  and  since  it  has  zero  integral 

it  must  be  zero  except  for  a  set  of  -  -measure  zero.  Hence  either  g\  —  g2.  or 
g i  —  g:  has  2;>  sign  changes.  ■ 

Remark  As  is  clear  from  the  proof  foi  this  result,  our  oscillation  results  still 
hold  if  we  r< -place  xk  in  (2)  with  any  system  of  functions  q^-(x).  such  as  xke~x. 
provided  that  one  can  construct  a  polynomial  .4(x)  --  Tl  akak(x)  wh  "h  has 
any  pr<'-pe<  ihrd  set  of  2p  —  1  zeroes.  Such  an  approach  could  be  usel  in 
impioving  on  the  robustness  of  the  sample  moments  in  applications  by  using 
b"und‘  I  va; in !•!<•*  Midi  as  o*,(x)  —  xlf~x.  The  next  theorem,  however,  uses 

tic  -  sp- ■<  la]  f  u  ill  < >f  ./•  V  ■ 
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Theorem  3.3  Provided  Gx—G2  is  not  identically  zero,  S~(Gx  —  G2)  —  2p—l, 
with  sign  changes  in  the  order  (  +  ,  —  The  roots  occur  between  the 

roots  of  g\  —  g2. 


Proof  An  upper  bound  is  obtained  on  the  number  of  sign  changes  by  ap¬ 
pealing  to  the  sign  change  behavior  of  gx  —  g2.  The  function  Gx  —  G2  is 
increasing  on  the  intervals  [a,  b]  where  gx  —  g2  >  0: 

Gx(b)  -  G2(b)  -  (Gila)  -  G2(a))  =  J  I[a  <  x  <  b\  (gx  -  g2)(x)  dy(x)  >  0. 

From  this  it  follows  that  Gx  —  G2  has  at  most  one  crossing  in  each  interval 
where  gx  —  g2  is  constant  in  sign,  but  has  none  in  the  first  or  last  interval. 
Hence  S~(Gi  —  G2)  <  2p  —  l.  Integration  by  parts  gives 

0  =  J  xd(Gx  -  G2)(x)  =  J[G2-  Gx)(x)dx, 

and  more  generally 

0  =  J  xkd(Gx  -  G2)(x)  =  J  xk~x\G2  -  Gx](x)dx, 

up  to  k  =  2p  —  1.  Now,  follow  the  proof  of  Theorem  3.2.  If  G2  —  Gx  had 
2p  —  2  or  fewer  sign  changes,  a  polynomial  -4(j)  of  degree  2p  —  2  could  be 
constructed  with  matching  signs.  Hence  A(x)[G2  —  Gi](x)  >  0,  but  has  zero 
integral.  The  result  follows.  ■ 


For  continuous  A',  a  diagnostic  plot  based  on  a  nonparametric  empirical 
analog  of  Gx  -  G2  can  be  constructed  directly.  Let  Fn.  the  empirical  distribu¬ 
tion  function,  be  an  estimate  of  the  alleged  distribution  Gj  and  let  G2  be  an 
estimate  of  G ,  constructed  by  using  the  method  of  moments  estimates  of  the 
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p-componcnt  model.  Naturally,  Fn  and  G 2  have  2p-  1  moments  in  common. 
It  follows  that  if  Fu  —  G2  has  the  appropriate  sign  change  behavior,  then 
the  data  provide  some  support  for  using  more  than  p  components.  On  the 
other  hand,  if  a  p- point  mixture  is  the  correct  model,  then  the  asymptotic 
properties  of  Fri  —  G2  can  be  obtained  from  empirical  process  theory. 

4  Normal  Mean  Mixtures  with  Unspecified  Variance 

In  this  section  we  consider  a  mixture  model  of  great  interest  —  the  normal 
mean  mixture.  We  use  the  following  notation:  let  f(x:g,r)  denote  the  den¬ 
sity  of  a  .Y(/t,r)  random  variable  and  let  g(x;Q,r)  =  f  f(x\p,r)dQ(p)  de¬ 
note  a  mixture  of  normals  with  corresponding  distribution  function  G(x;  Q,  r). 
If  r  were  known,  then  this  is  just  a  special  case  of  the  previous  section:  how¬ 
ever,  in  practice,  r  will  typically  be  unknown  and  hence  we  treat  it  as  a  free 
parameter.  In  this  section  we  extend  our  results  to  this  case.  We  first  present 
an  existence  theorem,  due  to  Lindsay  (1989),  which  extends  the  classic  mo¬ 
ment  results  presented  in  Section  2  to  normal  mixtures. 

Theorem  1.1  If  Q  is  a  distribution  with  more  than  p-pomts,  then  there 
exists  a  unique  p-point  distribution  Qp  and  variance  rp  >  r  such  that 

I  xkdG{x\Qp,rr)  =  J  xkdG(x-,Q,r)  for  k  —  0. 1, . .  .  ,  2p.  (8) 

Proof  Wild'1  this  is  not  explicitly  stated  in  Lindsay  (1989).  it  is  a  conse- 
qm  n< a  of  L<  nima  5A  and  Theorem  5C.  In  the  latter,  replace  the  empirical 
moments  wit  h  t  In-  moments  of  A"  under  G(  ■:  O.  r ).  ■ 
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Theorem  4.2  If(QP,Tp)  satisfies  (8)  for  Q  =  Qp+ 1(  a  p  +  1 -point  distribu¬ 
tion,  then 

9(x;Qp+ur)  -  g{x\Qp ,  rp) 

has  exactly  2p  +  2  sign  changes,  occuring  in  the  order  (  — 

Proof  Since  rp  >  r,  we  can  represent  the  above  difference  as 

g{x]Q,r)  -  g(x;  Qp,  r ) 

where  Qp  is  the  convolution  of  Qp  with  a  normal  distribution  wdth  mean  zero 
and  variance  rp  —  r.  By  the  same  argument  as  in  Lemma  1,  this  means  there 
are  a  maximum  of  2p  +  2  sign  changes.  The  polynomial  argument  used  in 
the  proof  of  Theorem  3.2  can  now  be  used  together  with  (8)  to  show  that 
there  are  at  least  2p  +  1  sign  changes.  Moreover,  since  Qp  has  more  mass 
in  the  tails  than  the  discrete  Qp+\ ,  the  difference  g(x\  Q ,  r)  —  g(x ;  Qp,  r)  will 
have  a  negative  sign  in  both  tails,  and  so  must  have  an  even  number  of  sign 
changes,  hence  2 p  4-  2.  ■ 

Theorem  4.3  G(x\Q.r )  —  G(x:Qp.rp)  has  exactly  2/)+  1  sign  changes,  in 
the  order  (  — .+,...,4). 

Proof  A  similar  argument  to  Theorem  3.3.  ■ 


Graphical  techniques,  such  as  the  normal  scores  plot  (Harding  1948. 
Cassie  1954)  and  the  modified  percentile  plot  (Fowlkes  1979)  have  played 
an  important  role  in  identifying  whether  data  follows  a  mixture  of  two  nor¬ 
mal  distributions.  The  geometric  characterizations  obtained  herein  extend 
the  arsenal  of  potential  diagnostic  plots  for  normal  mixtures. 
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5  D  iscussion 


Our  results  above,  in  the  normal  case,  indicate'  that 

5(x;Q2,t)  -  g.  <72) 

has  4  sign  changes  in  the  order  (  —  provided  /i  is  the  mean  of 

Q 2  and  <72  =  \'ar( X)  =  t  - (-  Var(Q2).  For  this  case  a  supplementary  result 
is  available  from  Roeder  (1992).  If  we  instead  examine  the  ratio  R(x)  — 
g(x\  Q 2,  r)/f/(x;  //,  a2),  we  obtain  a  function  proportional  to  a  bimodal  normal 
density.  By  combining  the  two  results  we  can  see  that  R(x)  is  bimodal  and 
that  both  modes  are  greater  than  1. 

In  the  normal  model,  with  ~i  =  tr2  =  1/2,  the  density  g(x:Q2,r)  is 
bimodal  h  and  only  if  the  two  separate  supports  g\  and  g2  satisfy  |//j  — 
g2\  >2 r  (Robertson  and  Fryer  1969).  Thus  the  ratio  function  is  much  more 
sensitive  to  the  existence  of  two  support  points  than  is  the  density  itself. 
This  sensitivity  continues  to  exist  even  for  very  small  support  weights  tt*. 
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Abstract 


The  normal  distribution  has  long  been  the  usual  model  for  the  analysis  of  multivariate  data. 
Moment  and  probability  calculations  for  the  multivariate  normal  are  used  in  applications  such 
as  the  construction  of  confidence  sets,  the  assessment  of  error  rates  in  signal  processing,  and  the 
construction  of  optimal  quantizers.  Recently,  the  family  of  elliptically  contoured  distributions, 
which  includes  the  normal,  has  been  extensively  studied.  In  this  paper,  we  discuss  moment  and 
probability  calculations  for  this  broader  class,  paying  particular  attention  to  the  approximation 
of  tail  probabilities. 
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1  Introduction 


The  normal  distribution  has  long  been  the  usual  model  for  the  analysis  of  multivariate 
data.  Moment  and  probability  calculations  for  the  multivariate  normal  have  therefore  been  well 
studied  for  various  cases  of  interest.  In  statistics,  a  common  application  of  such  quantities  is 
the  construction  of  confidence  sets  for  parameters  of  the  normal  distribution.  Other  examples 
include  the  assessment  of  error  rate  probabilities  in  signal  processing,  the  const  ruc'mr.  of  optimal 
quantizers  for  a  Gaussian  process,  and  the  computation  of  a  high  order  correlation  coefficient  of 
the  outputs  from  a  zero-memory  non-linear  device  with  Gaussian  inputs. 

The  general  problem  is  still  intractable,  owing,  to  the  great  difficulty  in  evaluating  high 
dimeiioi'mrJ  integrals,  but  advances  in  computing  technology  and  recent  research  has  yielded 
innovative  Monte  Carlo  and  numerical  integration  techniques.  These  advances  have  widened 
the  scope  of  such  investigations  to  include  other  multivariate  distributions.  For  instance,  there 
are  the  elliptically  contoured  distributions  and  the  multivariate  Pearson  family  of  distributions, 
both  of  which  include  the  multivariate  normal.  Elliptically  contoured  distributions,  in  particular, 
have  been  extensively  developed:  see  the  collection  of  papers  about  them  that  was  recently  edited 
by  Anderson  and  Fang  [2]. 

In  this  paper,  we  study  the  computation  of  probabilities  and  moments  for  certain  elliptically 
contoured  distributions,  and  discuss  their  applications.  There  are,  of  course,  many  classes  of 
events  whose  probabilities  are  of  interest,  and  many  functions  whose  expectations  are  of  interest . 
Our  focus  will  be  on  the  evaluation  of  tail  probabilities,  and  on  methods  for  computing  product 
moments,  and  other  non-linear  functions  of  the  components  of  the  random  vector.  In  Section  2. 
we  introduce  elliptically  contoured  distributions,  and  describe  their  properties.  Historically,  mo 
ment  methods  have  been  associated  with  Pearson’s  family  of  distributions.  Since  some  elliptically 
contoured  distributions  are  also  natural  multivariate  versions  of  some  of  Pearson’s  distributions. 
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we  briefly  describe  this  connection  also.  In  Section  3,  we  discuss  applications  of  tail  probabili¬ 
ties,  and  describe  methods  for  approximating  them  accurately.  These  methods  include  Monte 
Carlo  with  importance  sampling,  and  asymptotic  approximations  that  generalize  Mills’  ratio  for 
the  normal  distribution.  In  Section  4,  we  turn  to  moment  calculations  for  elliptically  contoured 
distributions  using  one  of  three  tools:  the  characteristic  function,  a  stochastic  representation, 
and  a  certain  partial  differential  equation  satisfied  by  sufficiently  smooth  elliptically  contoured 
densities. 


2  Elliptically  Contoured  Distributions  and  Pearson  Families 


A  p-dimensional  vector  X  has  an  elliptically  contoured  distribution  if  there  is  a  non-negative 
definite  matrix  E  =  (<x,j)  such  that  the  characteristic  function  of  X  is  f(t)  =  elt,tiip(t‘Xt),  where 
ip  is  a  real-valued  function  on  1R+  =  [0,oo).  Then  X  has  the  stochastic  representation 

A  =  p  +  rS1/2t/p,  (1) 


where  p  is  the  center  of  symmetry,  the  radial  part  r  is  a  non-negative  random  variable,  and 
Up  is  uniformly  distributed  on  flp,  the  surface  of  the  unit  sphere  in  p-dimensions;  r  and  Up  are 
independent.  The  matrix  E1/2  is  a  square  root  of  E:  for  computations,  it  is  convenient  to  take 
E1/2  to  be  tiie  lower  triangular  matrix  from  the  Cholesky  decomposition,  or  the  non-negative 
definite  symmetric  square  root  derived  from  the  spectral  representation  of  E.  When  X  has  a 
density  /,  it  is  of  the  form 

E)=|E|-iS(Q),  (2) 


where  Q  =  Q(z,p,  E)  =  (x  -  p)'E  '(x  -  p),  g  :  IR+  —  R  +  , 


ap  I  rv  \j(i 
Jo 


2 ) dr  =  1, 


(3) 
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and  ap  is  the  area  of  flp;  the  level  curves  of  /  are  ellipses  determined  by  {x  :  Q  =  c}.  In  this  case, 
t  has  the  density  hT(r)  =  aprp_1g(r2).  Examples  of  elliptically  contoured  distributions  include 
the  normal,  for  which  g(r )  =  V,(r)  =  e-^2,  and  the  p-variate  t  distribution  with  v  degrees  of 
freedom,  for  which 


f  (x- u  T,)—  i  4.  Q /j,)-(p+1')/2 


(4) 


Another  example  is  due  to  Iyengar  [12]  (see  also  [15]): 


fp,k(x i  P>  V) 


t(p/2) 

r(*  +  p/2) 


l^£|"1/a  (Q/v)k  exp (-Q/V). 


(5) 


where  p  >  0  and  k  >  0.  When  k  —  0,  (5)  yields  the  normal  distribution.  For  the  bivariate 
case,  Kotz  [20]  has  also  studied  this  family.  The  uniform  distribution  on  is  yet  another 
example  which  will  be  used  for  moment  calculations  below;  it  does  not  have  a  density.  For 
further  discussion  of  elliptically  contoured  distributions,  see  Anderson  and  Fang  [2],  Das  Gupta, 
et  al.  [8],  and  Cambanis,  et  al.  [5]. 

In  one  dimension,  Pearson’s  family  of  distributions  is  defined  by  the  following  differential 
equation  satisfied  by  their  densities  (see  Cramer  [7]): 

d  log  f(x)  _  x  +  a 

dx  bo  +  bix  +  62X2  5 


Within  this  family,  the  first  four  moments  determine  the  distribution.  Several  types  of  Pearson 
distributions  (depending  on  a,bo,b\,  and  62)  have  been  identified.  In  addition  to  the  normal, 
the  common  types  are  the  beta  (Type  II),  gamma  (Type  III),  and  Student’s  t  (Type  VII).  The 
elliptically  contoured  distributions  given  by  (4),  and  (5)  are  multivariate  versions  of  Types  VII 
and  III,  respectively.  For  example,  when  R  =  I  and  p  =  0,  the  density  for  the  p-variate  / 
distribution  with  u  degrees  of  freedom,  satisfies  the  following  differential  equation: 


vl°g/^(*;0,/) 


(p  +  l/)x 
v  +  x'x 


(7) 
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However,  there  is  an  important  difference  between  (-1)  and  (5).  For  (5),  if  /z  =  0  and  k  >  0,  then 
the  density  at  the  origin  is  0,  and  the  modal  value,  or  peak,  of  the  density  occurs  on  the  surface 
of  the  ellipsoid  {x  :  x,S~1x  —  kr/j.  On  the  other  hand,  the  density  in  (4)  has  its  peak  at  the 
origin,  and  it  is  unimodal.  Several  results  that  apply  to  the  normal  and  (4)  do  not  generalize  to 
(5);  see  Tong  [34]  for  further  details. 

3  Tail  Probabilities 

If  X  is  a  random  variable  with  density  /  and  cumulative  distribution  function  F,  the  tail 
probability  of  X  refers  to 

d  =  1  -  F(a)  =  r  f(x)  dx  (8) 

J  a 

for  large  values  of  a.  In  many  statistical  applications,  such  as  hypothesis  testing,  the  tail 
probability  of  interest  is  around  0.05.  For  such  cases,  the  computation  of,  say,  p- values  is  usually 
straightforward.  In  other  applications,  especially  in  engineering,  much  smaller  probabilities  are 
of  interest.  For  instance,  in  signal  processing,  the  tail  probability  arises  as  the  error  rate  of 
a  complex  communications  system  (Scharf  [30],  Wessel,  et  al.  [35]);  and  in  reliability  theory, 
it  arises  as  the  failure  rate  of  a  system  component  (Lawless  [22]).  Often  such  systems  have 
redundancies  built  into  them,  so  that  their  error  or  failure  rates  are  very  low.  A  simple  model 
of  failure  regards  A'  as  an  overall  index  of  stress,  and  considers  very  large  values  of  the  failure 
threshold,  a. 

In  this  formulation  of  the  problem,  two  difficulties  arise.  First,  the  usual  quadrature  rules 
and  Monte  Carlo  methods  for  evaluating  6  are  not  sufficiently  accurate,  so  specialized  methods 
are  needed  for  evaluating  tail  probabilities.  We  will  turn  to  some  of  these  methods  below.  Next, 
the  basis  for  the  choice  of  probabilisitic  model  (that  is,  F)  is  tenuous.  This  is  because  for  a 
complex  system,  the  theoretical  derivation  of  F  based  on  individual  component  characteristics  is 
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intractable;  also,  data  to  estimate  6  is  sparse  since  the  event  of  interest  is  rare.  While  information 
about  the  central  region  (near  the  mean  or  median)  of  F  is  usually  available,  the  tail  behavior 
is  usually  unknown,  so  extrapolation  is  necessary.  One  way  of  addressing  this  problem  is  to 
consider  a  wide  range  of  plausible  models  for  the  tail  behavior  to  derive  a  range  of  values  for  the 
tail  probability.  For  one  example  of  just  such  an  approach,  see  Lavine  [21],  who  studied  shuttle 
O-ring  data. 

Multivariate  versions  of  this  problem  arise  in  similar  fashion:  for  instance,  a  system  with 
two  components  may  fail  when  each  component’s  stress  exceeds  its  respective  threshold,  leading 
to  the  failure  probability  P(X \  >  ai,X2  >  a2).  A  number  of  new  difficulties  also  arise.  First, 
multiple  integration  is  still  a  hard  problem  in  general,  so  with  few  exceptions  multivariate  tail 
probabilities  are  not  well  studied.  Also,  a  tail  region  can  take  on  many  shapes,  for  example, 
{x  :  x\  >  ai,a;2  >  <12},  {*  :  »i£i  +  017X2  >  a},  or  {2  :  x\  +  x\  >  a2}.  Below,  we  restrict 
attention  to  convex  regions  that  are  far  from  the  center  of  the  distribution,  eliminating  the  last 
example  from  consideration. 

There  are  two  main  sources  of  error  in  assessing  tail  probabilities.  The  first  is  numerical: 
it  is  generally  hard  to  evaluate  a  small  quantity  with  small  relative  error.  For  a  deterministic 
method,  if  0  is  an  approximation  to  6,  the  relative  error  is  (9  —  0)/0.  For  a  Monte  Carlo  method, 
the  coefficient  of  variation  (the  ratio  of  the  standard  deviation  to  the  mean  of  an  estimator)  is 
a  measure  of  the  relative  error.  If  the  unbiased  estimator  9n  of  6  is  an  average  of  n  independent 
replicates,  its  squared  coefficient  of  variation  (cv2)  is 


cv2(0„) 


var(0n) 

02 


I  [Ml  1 

n  6 2 


(9) 


Below,  we  study  the  use  of  Monte  Carlo  with  importance  sampling  to  derive  estimators  for 
which  the  cv2  is  small.  If  B  is  a  tail  region,  and  /  is  the  density,  importance  sampling  uses  the 
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expression 


for  some  “sampling  density”  g  to  get  an  unbiased  estimator  which  is  the  average  of  n  independent 
replicates  (over  the  set  B)  of  the  likelihood  ratio  l(Y),  where  Y  has  density  g.  We  seek  those  g 
for  which  the  cv2  is  bounded  as  the  tail  probability  tends  to  zero. 

The  second  source  of  error  is  statistical:  the  uncertainty  in  the  choice  of  the  model  F  makes 
the  tail  probability  estimate  uncertain,  even  if  there  were  no  numerical  error.  There  are  several 
ways  to  address  this  issue.  One  is  to  introduce  a  plausible  family  of  models,  and  compute  a 
range  of  tail  probabilities  for  that  family.  Another  is  to  follow  the  approach  of  Johnstone  [19], 
for  the  Pearson  family.  He  estimates  the  parameters  of  the  family  from  available  data,  and 
then  provides  an  estimate  of  a  given  quantile  with  its  standard  error.  Yet  another  approach  is 
Bayesian:  first  model  the  uncertainty  in  F  by  putting  a  prior  on  it,  and  then  use  available  data 
to  compute  the  posterior  distribution  of  the  tail  probability. 

We  start  with  the  univariate  case  to  motivate  the  multivariate  case  below.  If  X  has  density 
/,  l’Hopital’s  rule  says  that  with  suitable  regularity,  the  asymptotic  behavior  of  P(X  >  a)/f(a) 
is  the  same  as  that  of  r(a)  =  -f(a)/f'(a).  The  regularity  conditions  are  that  f(t)  ^  0  for  all 
sufficiently  large  f,  and  that  the  ratio  r(a)  have  a  limit  as  a  — +  oo;  these  conditions  are  met  in 
many  cases  of  interest.  Writing 

[  f(x)dx  =  r(a)f(a)  f  ~-+^dx,  (11) 

Ja  Jo  r(a)f(a ) 

it  is  clear  that  (under  the  same  regularity  conditions)  the  last  integral  in  (11)  approaches  1  as 
n  — »  oo;  thus,  it  is  bounded  away  from  0,  and  estimating  it  with  good  relative  accuracy  can 
l  e  done  using  importance  sampling.  This  heuristic  has  been  extended  by  Gray  and  Wang  [11], 
where  the  generalized  jackknife  is  used  for  evaluating  univariate  tail  probabilities.  The  method 
suggested  below  may  be  regarded  as  a  Monte  Carlo  analog  of  that  procedure. 
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For  the  normal  distribution,  (11)  yields 


r  ^ r  a<K:,\a)d* = ^  r  '-* ''<*-*** . 

7a  a  Jo  </>(a)  a  Jo 


(12) 


which  s"ggests  the  estimator 


9  =  *Me-T2'\ 


(13) 


where  T  has  the  exponential  density,  ae  a*  for  t  >  0.  Now,  let  $(x)  and  <f>(x)  denote  the 
univariate  standard  normal  distribution  and  density  functions,  respectively,  and  let 


M(x)  =  =  ~  f°°  e~t2,2xe-xtdt 

<t>(x)  x  Jo 


(14) 


denote  Mills’  ratio.  Since  M  is  a  convex,  decreasing  function  (Iyengar  [13]),  the  following 
inequalities  are  easy  to  prove: 


1  +  x 


These  inequalities,  in  turn,  imply  that 


X  1 

<  M(x )  <  -  for  x  >  0. 


(15) 


{  }  M{aYaV2  a2 


(16) 


as  a  — ♦  oo,  so  that  the  cv2  tends  to  zero  as  a  increases.  This  estimator  results  from  the  sampling 
density  g{t)  —  ae-a(t-“)  for  t  >  a.  The  deterministic  analog  of  this  result  is  that 


0  < 


<t>(a)/a  -  $(-a)  _  1 

$(— a)  aM(o) 


-  1  < 


(17) 


so  that  the  relative  error  in  approximating  $(-a)  by  cp(a)/a  decreases  to  zero  as  a  increases. 

The  phenomenon  observed  in  ( 16)  is  quite  general:  for  a  wide  class  of  problems,  the  coefficient 
of  variation  actually  tends  to  zero,  hence  the  relative  accuracy  improves  as  the  threshold  a 
increases.  In  addition,  this  method  is  feasible  since  the  calculation  of  r(a)  depends  on  the 
differentiation  of  the  density  rather  than  its  integration;  since  the  behavior  of  the  tail  probability 
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is  already  captured  by  r(a)/(a),  the  evaluation  of  the  remaining  integral  by  Monte  Carlo  provides 
a  correction  term.  In  practice,  either  (11)  or  one  of  the  following  two  expressions  for  6  is  also 
useful: 


6  =  r M/M  [  =  r(a)f(a)  J* 


a  f(a  +  ax) 
r(a)f{a) 


Two  other  examples  illustrate  this  technique.  The  first  involves  the  generalized  inverse 
Gaussian  distribution,  whose  density  is 


f(t  |  a,  (3,  A)  =  ^xp  at  +  p/t )  ,  for  t  >  0, 


where  K\  is  the  modified  Bessel  function  of  the  third  kind  with  index  A.  The  parameter  space 
is  the  union  of  the  following  three  sets:  (a  >  0,/3  >  0),  {a  =  0,/?  >  0,  A  <  0),  and  {a  >  0,/3  = 
0,A  >  0}.  This  family  includes  the  gamma,  the  inverse  Gaussian,  the  hyperbola  distribution, 
and  their  reciprocals,  in  the  sense  that  if  X  has  density  f(t  \  a,/?,  A),  then  X'1  has  density 
f(t  |  /?,<*,  -A).  For  the  case  a  >  0,  /3  >  0,  this  method  yields  the  estimator 


9  =  ~f(a  |  ot,(3,  A)  e^2“(l  +  —  )A-1exp  \-\ 
a  aa  2 


a/? 

(aa  +  2T)\  ’ 


for  sufficiently  large  a,  where  T  has  a  standard  exponential  density.  The  second  example  is  the 
t  distribution  with  k  degrees  of  freedom,  with  density  fk{x)  proportional  to  (1  +  z2/fc)-(fc+1)/2, 


for  which  the  estimator  is 


9  =  \fk{a) 


\k  +  a2)y2 
k  +  a2Y2 


(fc+l)/2 


(21) 


where  Y  has  the  Pareto  density  k/yk+l  for  y  >  1. 


In  both  cases,  the  cv2  decreases  to  zero  as 


a  — >  oo.  Detailed  proofs  of  these  and  related  results  are  given  in  [17]. 

We  now  turn  to  the  multivariate  case.  In  1962,  Slepian  [32]  proved  the  following  inequality. 
Let  X  ~  Np( 0,  E  =  (<7,y))  and  Y  ~  NP(Q,T  -  (rtJ))  with  >  rt]  and  <r„  =  r^;  then  for  any 
vector  a,  P( X  >  a)  >  P(Y  >  a),  where  x  >  a  means  that  z,  >  a,  for  all  i.  Slepian  derived 
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this  result  using  Plackett’s  identity  (see  Section  4  below)  in  a  study  of  one-sided  boundary 
crossing  problems  for  Gaussian  processes.  Since  Slepian  proved  his  inequality,  his  result  has 
been  generalized  in  a  number  of  ways.  For  instance,  the  inequality  holds  for  all  elliptically 
contoured  distributions:  see  Das  Gupta,  et  al.  [8]  and  Tong  [34]  for  such  results. 

When  <Jij  >  0  for  all  i  and  j,  the  inequality  P%(X  >  a)  >  Pi(X  >  a)  yields  a  lower  bound 
which  can  be  easily  computed  for  the  normal,  since  then  it  is  a  product  of  univariate  normal 
probabilities.  However,  this  lower  bound  often  gives  a  poor  approximation  (see  Iyengar  [14]),  so 
that  Slepian’s  inequality  is  more  useful  for  theoretical  investigations.  Thus,  in  this  section,  we 
describe  alternative  methods  that  provide  good  approximations. 

Suppose  that  X  is  a  p-variate  vector  which  has  an  elliptically  contoured  distribution  with 
density  |  E  |-2  g(x'S~1x);  further,  let  term  “tail  region”  refer  to  a  closed  convex  region  B 
that  is  far  from  0  (of  course,  B  should  have  non-empty  interior,  else  the  probability  will  be 
zero).  If  E  =  L'L  is  the  Cholesky  decomposition  of  E,  then  Z  =  L~*X  has  the  density 
f(z)  =  /(£;  0,/)  =  g(z'z),  and  P(X  £  B)  =  P(Z  £  A  =  L~x B).  Since  A  is  closed  and 
convex,  it  contains  a  unique  point,  a,  that  is  closest  to  the  origin:  |a|<|z|,  for  z  £  A,  and  A  is 
contained  in  the  half  plane  {z  :  z'a  >  a'a }.  Since  Z  has  a  spherically  symmetric  distribution, 
A  can  be  rotated  so  that  a  =  re\,  where  ei  is  the  unit  vector  in  the  z\  direction,  and  r  =|a|. 
Note  that  r  =  r(A)  depends  upon  the  set  A]  for  notational  convenience,  this  dependence  will  be 
suppressed.  Next,  if  /?  =  La,  then  (5  minimizes  the  Mahalanobis  distance,  (x'E-1x)1/,:2,  of  points 
in  B  to  the  origin;  also,  B  is  contained  in  the  half  plane  {x  :  x'E-1/?  >  /J'E-1/?}.  Of  course, 
the  problem  of  finding  (3  is  a  quadratic  programming  problem  which  can  be  solved  using  known 
techniques.  For  any  set  A,  matrix  D ,  and  vector  c,  let  DA  +  c  denote  the  set,  {Dx  +  c  :  x  £  A}. 

To  estimate  9  =  P(Z  £  A),  ordinary  Monte  Carlo  averages  n  independent  replicates  of 
I(Z  £  A),  where  /  is  an  indicator  function.  This  estimator’s  variance  is  (9  -  92)jn.  An 
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alternative  approach  is  to  use  f(z  —  a)  as  a  sampling  function  (Wessel,  et  al.  [35]  refer  to  this 


as  improved  importance  sampling).  The  expression 


=  /  1'/^  ~\f(z  ~a)dz  ~  f 

Ja  f(z  -  a)  Ja-c 


f{z  +  a) 


f(z)  dz 


suggests  the  unbiased  estimator 

e  =  ^j^I(ZeA-a).  (23) 

If  g  is  a  decreasing  function  —  that  is,  /  is  unimodal,  as  is  the  case  with  the  p-variate  normal 
or  t,  but  not  the  family  given  in  (5)  —  then  f(z)  <  f(z  -  a)  for  z  6  A,  and 

£(«*)  =  f  t #W(--)  S  9  ■  <24) 

JA  -  a) 

so  that  6  has  a  smaller  variance  (and  smaller  cv2)  than  ordinary  Monte  Carlo.  However,  it  can 
be  shown  that  for  several  cases  (the  normal  and  the  t),  the  cv2  tends  to  infinity  as  a  — ►  oo  (see 
[17]).  Thus,  we  turn  to  multivariate  analogs  of  the  method  described  in  (12)  above. 

Although  a  direct  generalization  of  (12)  is  not  available,  the  analog  is  to  write  A0  =  A  -  a, 


*=//«*  =  /(«)/ 

Ja  Ja0  /(a) 


and  to  manipulate  the  ratio  f(z  +  a)//(a)  to  derive  an  estimator  that  has  bounded  cv2  as  the 
region  A  moves  outward  to  infinity.  Just  as  in  the  one- dimensional  case,  there  is  no  generic 
method  that  will  work  for  all  <7;  and  unlike  the  one-dimensional  case,  the  shape  of  A  (or  equiv¬ 
alently  the  shape  of  B  and  the  dependence  among  the  random  variables  as  given  by  £)  plays 
an  important  role  in  the  choice  of  sampling  function.  We  now  sketch  the  details  for  the  normal 
and  t  distributions. 


For  the  normal  with  density  4>p{z)  =  d>p(z;0,/),  (25)  becomes 


*  =  *,(«)/  =  /  M  e-|o|*I-*i^p_1(ii)iiud«I,  (26) 

Ja0  9P{a)  JAn 
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where  u  =  ( z2 , . . . ,  zp).  Next,  for  the  t  density  fp(z)  =  fp,v{z\  0,1),  a  slight  modification  of  (12) 
is  needed.  Let  A\  =  A/  |a|  to  get 


9  =  W  L,  (jtw)' 

Now  using  the  sampling  density  which  is  proportional  to  |z|  “(?+*')  on  A\ ,  we  get 


,2  \  (p+")/2 


dz.  (27) 


8  =  f  (a)  Iq-Jp  /  f  (^N!MVP+"/2  jj 

/p(  }l  1  JAl  [  »+  M2M2  J 


(28) 


Such  expressions  provide  guidelines  on  the  nature  of  the  sampling  function  to  use  for  im¬ 
portance  sampling.  The  specific  choice  depends,  as  mentioned  before,  on  the  nature  of  A, 
specificaiiy,  on  the  shape  of  A  near  the  origin  (or  A\  near  the  point  ei).  In  particular,  let 
B  =  {x  :  x i  >  b\,xi  >  62)5  where  the  6,  are  positive;  without  loss  of  generality,  suppose  that 
61  <  62-  When  the  correlation  between  X\  and  X 2  is  p,  the  point,  /?,  that  is  closest  to  the  origin 
(using  Mahalanobis  distance)  is 


P  = 


(bub2)  if  p  <  61/62 
(pb2,b2)  if  p  >  b\/b2. 


(29) 


Transforming  to  the  independent  case  and  rotating  so  that  the  nearest  point,  a,  is  in  the  e\ 
direction  gives 


a  = 


([b'R-lb]l/2,0)  if  p  <  6i/62 


(30) 


[  {b2, 0)  if  p  >  61/62- 

The  region  A  is  given  in  Figures  1  for  p  <  61/62,  and  2  for  p  >  b\/b2.  Since  the  nature  of 
/lo  =  A  —  a  at  the  origin  is  determined  by  the  difference  p  -  61/62,  the  ratio  61/62  will  be 
preserved  in  the  calculations  above:  in  effect,  the  region  B  will  be  moved  outward  towards 
infinity  in  the  direction  of  the  vector  6  =  (6i,62). 


{FIGURES  HERE} 
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We  will  now  provide  some  of  the  details  for  the  normal  distribution;  for  a  fuller  ac>  nt, 
see  [17].  When  the  correlation  coefficient  p  is  not  large  (p  <  61/62  when  61  <  62),  the  bivariate 
sampling  function  consisted  of  a  product  of  two  exponential  densities,  and  when  p  is  large,  the 
sampling  function  consisted  of  the  product  of  an  exponential  and  a  normal.  This  is  intuitively 
plausible,  since  for  small  p,  the  bivariate  normal  density  is  not  far  from  the  independent  case, 
while  for  large  p,  it  is  not  far  from  the  singular  casp,  for  which  the  exponential  given  in  (113) 
yields  accurate  estimates.  Transforming  back  to  X  (with  P12  =  p),  the  estimators  are  given  by 
the  following.  For  p  <  61/62, 

^2(6;  S)(l  -  p2)2  -T'R-'Ttt 

(6i-p62)(62  -p6i)  ’  V  ; 

where  T  =  (Ti,T2)  has  independent  exponentially  distributed  components  with  mean  vector 
((1  -  P2)/(6i  -  p62),(l  -p2)/(6i  -  p62)).  And  for  p  >  6i/62  it  is 

^e-T2/2/[(T,J/)€  M  (32) 

where  T  and  U  are  independent  with  densities  |a|  and  respectively,  and  Aq  = 

A  —  (62,0)  is  the  translate  of  the  set  given  in  Figure  2.  For  both  of  these  cases,  it  can  be  shown 
that  the  cv2  for  the  estimators  given  above  all  tend  to  zero  as  a  — *  00,  that  is,  as  the  tail 
probability  diminishes.  The  proof  for  the  normal  case  is  given  in  [17].  We  omit  the  proof  for 
the  t  distribution.  Instead,  we  turn  to  the  key  quantity  that  is  used  in  the  proofs,  Mills’  ratio. 

Several  definitions  of  the  multivariate  normal  Mills’  ratio  are  available.  The  first  definition 
is  due  to  Savage,  [29]  for  the  case  of  orthants: 


MX(B;R)  = 


P(X  e  B) 
</>P(6;  R)  ’ 


(33) 


for  X  ~  Np( 0,  R).  Another  definition  is  gotten  by  first  transforming  to  the  spherically  symmetric 
case  with  Z,  A ,  and  a  replacing  X ,  B,  and  (3  respectively.  For  r  =|a|  let 

P(Z  e  A) 


M2(A;/)  = 


4>{r) 


(3d) 
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This  definition  applies  to  convex  regions  A ,  not  just  orthants.  However,  the  two  definitions  do 


not  coincide  when  B  is  an  orthant.  For  R  ^  I, 


M2(A;I)  = 


p{z  e  A) 

(2tt)(p-1)/Vp(< 


1  =(  21 

or,  I)  ~  V|2tt 


2tt  y/2  P( X  €  B) 

^R\)  MfrR)  ’ 


so  that  the  two  definitions  differ  in  two  respects.  First,  in  place  of  /?,  it  uses  the  vertex  6; 
for  example,  when  (61,62)  =  (1,2)  and  p  =  0.95,  {(3\,(i2)  =  (1.9,2).  This  is  an  important 
difference,  because  when  the  correlation  is  high,  importance  sampling  centered  at  6  can  be  much 
worse  than  that  centered  even  at  the  origin  (see  [17]).  Second,  the  new  definition  has  the  factor 
(27T /  |  2 nR  I)1/2;  this  is  not  an  important  difference,  but  it  does  mean  that  proper  comparisons 
of  the  two  must  first  adjust  for  this  factor. 

For  the  multivariate  normal,  the  following  inequalities  for  M2  generalize  (15): 


M2(A;I)<  -P[(T,U)e  A0], 

T 


M2(A;  I)  >  -  P[(T,  U)  €  Aq]  -  /  ^ re~Tt 4>{u)dudt 

r  JAq  £ 

1  <2 

>  -  P[(T,  U)  £A0]-  —re~rt<f>(u)dudl  (37) 

T  J  0  2 

=  -  [ppMOeAd-i], 

r  tl  \ 

where  (T,U)  is  as  in  (32).  When  A  —  L~lB,  where  B  is  a  quadrant,  explicit  expressions  for 
the  bounds  in  (36)  and  the  first  line  of  (37)  are  available.  Such  inequalities  are  not  available  for 

M\.  These  inequalities  are  used  in  [17]  to  prove  that  the  estimators  in  (31)  and  (32)  have  cv2 

tending  to  zero  as  a  — »  00. 

Mills’  ratio  for  elliptically  contoured  densities  are  defined  analogously:  the  numerator  is 
P( X  €  B),  while  the  denominator  is  either  </>p(6;  R)  or  cf>p(f};  R)  for  M\  and  M 2,  respectively. 
In  [9],  Fang  and  X11  give  a  detailed  account  of  Mj  They  show  that  if  X  has  an  elliptically 
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contoured  distribution  given  by  (2),  where  g  is  a  non-increasing  function,  then  the  function 
-P(X  £  B)  is  a  Schur  convex  function;  they  use  this  fact,  along  with  standard  majorization 
results  to  provide  inequalities  for  M\.  A  detailed  study  of  the  analog  of  M2  for  other  elliptically 
contoured  distributions  has  not  yet  been  done. 

4  Computation  of  Moments 

In  his  paper,  Brillinger  [4]  noted  that  a  moment  generalizes  the  notion  of  a  probability,  since 

the  latter  is  the  first  moment  of  an  indicator  function,  which  is  a  building  block  of  integrable 

functions.  Here,  we  use  the  term  moment  to  denote  the  expected  value,  when  it  exists,  of  some 

function  of  a  random  vector,  that  is,  E[g(X)]  =  E[g(Xi, . . Ap)].  Conventionally,  (product) 

moments  are  defined  as  E  |n?=i  ^f‘]  >  where  are  non-negative  integers.  In  this  section,  we 

discuss  three  methods  for  computing  moments  for  elliptically  contoured  distributions.  The  first 

uses  the  characteristic  function  when  it  is  available,  the  second  uses  the  stochastic  representation 

(1)  when  the  moments  of  r  are  available,  and  the  third  uses  several  partial  differential  equations 

that  are  given  below.  Throughout,  let  X  =  \i  -f  rS1/2f/p,  as  in  (1). 

The  first  two  methods,  which  are  due  to  Li  [23],  are  of  course  equivalent;  computational 

convenience  dictates  the  choice  of  method.  Let  the  moment  (when  it  exists)  of  the  vector 

X  be  given  by  the  matrix  Tfc(A),  where 

' 

E[X  ®  X'  ®  X  . . .  ®  X']  if  k  is  even 

r*(JO  =  (7<:>)  =  1  (38) 

E[X®  X'®  X...®  X'®  X)  if  it  is  odd, 

where  ®  denotes  the  Kronecker  product,  which  has  k  terms  in  (38).  This  definition  reduces  to  the 
usual  mean  vector  and  covariance  matrix  when  it  =  1  and  2,  respectively;  ri(A')  =  fi  whenever 
the  first  moment  exists.  For  k  >  3,  the  following  recipe  tells  us  where  to  find  E  [nf=i  A,*']  (with 
Ef=i  k.  =  k)  in  r/t(A):  if  the  terms  in  the  product  are  strung  out  thus,  7^  =  E(XtiXi2  . . .  -Vu), 


then 


t(*+l)/2] 

=  1+  E  ovi-i )  P[(fc+1)/2]-j 

i= i 


[*/2] 

=  1  +  ll(*2i-l  -  1)  p[fc/2]~J, 

J  =  1 


where  [a]  is  the  greatest  integer  in  a. 

Using  this  notation,  the  matrices  r*;(A)  can  be  expressed  in  two  ways.  First,  if  the  charac¬ 
teristic  function  is  known,  repeated  differentiation  of  it  gives  the  following  expressions  for  k  —  2 
and  3: 

r2(X)  -  pp!  -  20'(O)E, 

r3(X)  =  p  <g»  p'  ®  p  -  2V>'(0)[/x  ®  E  +  E  <g>  p  +  vec(E)/z'],  (41) 

where  vec(E)  =  (on,  o2i, . . . ,  ap\, . .  ,,cr lp, . . . ,  opp)'  strings  out  the  columns  of  E  into  one  long 
vector. 

This  formulation  is  useful  for  the  family  (5),  for  the  characteristic  function  is  given  by 

V)  =  e""‘/4  E  (  *)  FT <42> 

\m/  r(m  +  p/2) 

so  that  -2V’,(0)  =  rj(2k  +  p)/2p.  A  proof  of  this  result  is  given  in  Iyengar  and  Tong  [15].  When 
the  characteristic  function  is  not  available,  but  the  moments  of  r  are  available,  t..e  representation 
(for  p  =  0  and  E  =  I)  X  =  rUp  implies  that  rjt(A)  =  r*r*(f/p).  Since  Tk{Up)  can  be  derived 
from  the  known  properties  of  the  normal  distribution,  -2^(0)  is  replaced  by  F(r2)/p  in  (41). 
For  instance,  for  the  multivariate  t,  the  characteristic  function  is  intractable,  but  the  density  of 


t  is  proportional  to 


rp-l(l  +  r2 /is)~(p+t/M2 ,  r>  0, 


which  yields  the  finite  moments  upon  integration.  Expressions  for  the  fourth  moment  T4  that 
involve  t)>"(0)  or  E(r4)  are  given  in  [23];  even  higher  order  moments  can  be  computed  along 
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the  lines  outlined  there.  Since  quadratic  forms  in  elliptically  contoured  distributions  arise  in 
standard  testing  procedures  (see  Anderson  and  Fang  [2]),  Li  also  provides  expressions  for  their 
moments. 

In  a  related  study,  Xu  and  Fang  [36]  define  an  n  xp  matrix  has  a  matrix  elliptically  contoured 
density  if  TX  has  the  same  distribution  as  X  for  every  n  x  n  orthogonal  matrix  T.  The  density 
then  has  the  form  cn<vf(x'  x)\  if  Y  =  XX1/2  for  a  p  x  p  covariance  matrix  E,  the  density  of  Y  is 
given  by 

cn>p  |E|-"/2  /(S^VyE'1/2).  (44) 


In  their  paper,  Xu  and  Fang  give  the  expected  values  of  zonal  polynomials  and  other  symmetric 
functions  of  W  =  Y'Y .  The  expressions  are  rather  involved,  so  we  omit  them. 

The  third  method  of  computing  moments  has  a  longer  history.  In  1958,  Price  [27]  proved 
the  following  result.  Let  Np(p,  E)  denote  a  p- variate  normal  with  mean  p  and  covariance  matrix 
E  =  (< ).  Suppose  that  X  —  (Xi, . . . ,  Xp)  has  a  Np(p,  E)  distribution  (written  X  NP(p,E)), 
and  let  <7i(Xi), . . . ,  gp(Xp)  be  differentiable  functions  of  the  components  of  A”,  each  admitting  a 
Laplace  transform;  then 


da. 


t[9k(Xk) 


=  E 


02  P 

UsklXk) 


for  i  ^  j. 


(45) 


*i  l  i  j  [dxidxj  V 

Conversely,  if  this  identity  holds  for  arbitrary  gi,...,gp  (with  both  expectations  above  defined) 
then  X  has  a  multivariate  normal  distribution.  Price  and  others  used  this  theorem  to  facilitate 
studies  in  signal  processing.  In  particular,  suppose  that  a  zero-memory  non-linear  input-output 
device  with  Gaussian  input  X,  that  yields  output  y,(X,).  The  p^-order  correlation  coefficient 
of  the  outputs  is  a  quantity  of  interest  which  requires  the  computation  of  the  expectation  of 
n?  gk(Xk)-  The  differential  equation  of  Price’s  theorem  provides  a  useful  computational  tool  for 
such  calculations.  Consider  the  following  trivial  example:  if  h\/>)  =  E(Xj AV/,  wiie.e  p12  —  p 
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— 

I 

I 

is  the  cciieUtion  between  the  standardized  variates  A'i  and  X2,  then  h'(p)  =  1,  and  h(p)  —  p 

follows.  . 

Although  Price’s  theorem  is  an  elegant  result,  it  has  several  limitations.  In  fact,  Pawula  ! 

[25]  (see  also  Papoulis  [24])  noted  that  when  p  =  2,  and  the  right  hand  side  of  (45)  can  be  ' 

evaluated  explicitly,  there  is  a  single  differential  equation  to  solve.  But  for  larger  p,  there  are  ! 

! 

j 

p(p—  l)/2  differential  equations  to  solve  simultaneously.  Furthermore,  Price’s  result  only  applied 
to  a  product  of  functions  of  individual  components  only.  Pawula  used  a  result  of  Plackett  [26]  to 
overcome  these  limitations.  In  1954,  Plackett  proved  the  following  identity  while  investigating 
a  reduction  formula  for  multivariate  normal  probabilities:  if  the  density  of  a  Np(p,  E)  variate  is 
<f>p{x  -  p,  E),  then 

A-^(x-„;S)=5A_*p(x-KE),  fori#,'.  (46) 

For  the  case  i  =  /,  we  have  the  diffusion  equation 

J~MX  — /*;E)  =  -m;E).  (47) 

Pawula  used  Plackett’s  identity  to  extend  Price’s  theorem  thus:  if  g(x\, . . . ,  xp)  is  sufficiently 
smooth  and  vanishes  rapidly  near  infinity,  then 

r\ 

-^-E[g(Xu...,Xp)}  =  E  ^-g(Xl,...,Xp)  for  i±j.  (48) 

This  extension  allowed  the  study  of  more  general  functions,  such  as  the  “linear  rectifier  correla¬ 
tor,”  g(xux2)  =\x\  +  x2\  -  \xi  -  x2\. 

Pawula  then  used  the  following  method,  also  due  to  Plackett,  to  reduce  the  number  of 
differential  equations  to  solve  from  p(p-  l)/2  to  one.  For  a  given  E  define  a  line  between  it  and 
the  identity  matrix  I,  Et  =  (1  -  t)I  +  <E  for  0  <  t  <  1.  The  chain  rule  then  gives 

!*<*-*«.)  -  -*£,).  H9) 

i<  j  1 
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so  that 


jEl\g{Xu...,Xp)}  =  Et 


Y,a'i 


8 2 


}<3 


dxidxj 


g(Xu...,Xp) 


(50) 


whore  Et  denotes  the  expectation  with  respect  to  N(fi,  S t).  When  the  right  hand  side  of  (50)  can 
be  evaluated,  a  single  ordinary  differential  equation  results.  By  solving  it,  Pawula  showed  how 
to  compute  the  moments  of  various  functions  of  X ,  such  as  products  of  Hermite  polynomials  or 
error  functions.  In  some  cases,  higher  order  derivatives  with  respect  to  t  are  needed:  they  are 
just  iterates  of  the  partial  differential  operator  on  the  right  of  (50). 

The  search  for  bounds  for  certain  probabilities  and  expectations  has  recently  led  to  several 
generalizations  of  Plackett’s  identity  to  elliptically  contoured  distributions.  The  first  is  a  result 
of  Joag-dev,  et  al.  [18]  which  only  requires  tha*  g  in  (2)  be  differentiable: 


-£~f{x\n,  £)  =  —7r-('52<7,kxk)f(x;»,'Z), 

OCT'i  oxj  *= i 


(51) 


where  a'k  is  the  i,k  element  of  £-1.  Another  is  due  to  Iyengar  ([12],  see  also  Iyengar  and  Tong 
[15]),  who  proved  the  following  identity  for  fpy. 

d  ,  V  k'  r(f  +  ™)  d 2  t  ^  ft.^ 

'  fp,k(x\ £,  Vi)  —  9  2-,  mi  t/p  i  u\  .  fp,m{xi  r/)-  (52) 


2^0m!  r(f  +  *)  dx,dx:- 

This  specializes  to  Plackett’s  identity  when  k  =  0.  Finally,  Gordon  [10]  proved  a  definitive 
version  of  Plackett’s  identity  for  elliptically  contoured  densities  (the  proof  of  which  he  traced 
back  to  [8,18]).  He  showed  that  the  following  two  statements  about  functions  g  and  h,  each 
mapping  IR+  into  itself  and  vanishing  at  oo,  are  equivalent: 


and 


da 


MO  =  \Jt  9(r)dr 


9  9z(x)  =  ^  M*). 


(53) 


dxtdxj 


(54) 
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where  gz(x)  =|T|-1/2  g(x'T,~1  x),  and  similarly  for  h.  When  g  is  an  exponential  or  an  appropri¬ 
ately  chosen  gamma  density  the  identities  of  Plackett  and  Iyengar,  (46)  and  (52),  respectively, 
follow.  Next,  for  the  p- variate  t  with  u  degrees  of  freedom,  we  have 


m  = 


r((p+  v)/2)  v 
(iri/)p/2r(i//2)  (p  +  v  -  2) 


(55) 


These  extensions  of  Plackett’s  identity  have  been  used  principally  for  theoretical  investiga¬ 
tions,  in  particular,  for  studying  the  nature  of  the  dependence  among  the  components  of  X .  A 
systematic  study  of  their  use  for  the  computation  of  moments  of  various  functions  (other  than 
the  usual  product  moments  given  by  r *, )  has  not  yet  been  done.  The  mathematical  basis  for 
Plackett’s  identity  goes  back  to  the  19*^  century  work  of  Schlalii  [31J  on  hyperspherical  sim- 
plices,  and  the  latei  work  of  the  geometer  Coxeter  [6].  For  more  on  the  geometrical  aspects  of 
Plackett’s  identity  and  related  issues,  see  Abrahamson  [1]  Iyengar  [16]  and  Ruben  [28]. 


5  Conclusion 

In  this  paper,  we  have  discussed  recent  developments  in  probability  and  moment  calculations 
for  eliiptically  contoured  distributions.  These  developments  should  allow  the  use  of  models  other 
than  the  multivariate  normal  for  high  dimensional  data.  Clearly,  much  more  work  needs  to  be 
done.  For  instance,  since  Monte  Carlo  is  an  increasingly  popular  method  for  assessing  the 
performance  of  various  systems,  a  more  systematic  study  of  appropriate  sampling  functions  is 
needed.  Only  the  beginnings  of  such  a  study  are  given  here. 
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Legends  for  the  two  Figures 


FIGURE  1:  p  <  61/62;  A  is  bounded  by  L\  and  L2. 

Lx  -  22  =  b-)l— (*1  -  ( b'R-xb)xl\  for  >  (b'R~lb)1/2 

(62  -  pbi) 

L2:  z2  =  ~^r~ — — (21  -  (6/i2_16)1/'2),  for  zx  >  (b'R~lb)l/2 
(61  -  pb2) 

FIGURE  2:  p  >  61/62;  A  is  bounded  by  L\  and  L2. 


T  (PZ 1  -  61)  r  ^  t 

Li:  Z2=  (1  ^2)1/2  ’  fOF  Z>  ^ 

r  k  r  ^  (^6  2  —  61) 

Z,2  :  21  =  62,  for  *2  <  -(-  -^2)1/2 
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Abstract 

In  this  paper  we  discuss  the  problem  discriminating  among  various  non-linear  time 
series  models.  While  the  method  we  propose  is  of  a  general  nature  we  consider  a  re¬ 
stricted  class  of  models  that  share  an  identical  A R ( 1 )  equivalent  correlation  function 
structure  jlience,  identical  spectral  density.  Consequently ,  the  possibility  of  discriminat¬ 
ing  among  them  on  t  he  basis  of  second  order  moments  is  t  heoretically,  and  practically, 
impossible.  The  approach  being  taken  is  aimed  at  discriminating  among  the  models 
on  t  he  basis  of  higher  order  moments  i.e.  the  higher  order  rumulant.  structure.  Specif¬ 
ically.  we  shall  focus  on  the  T'^-order  cumulant  structures  as  our  initial  step  beyond 
the  conventional  covariance  structure. 

Key  Words  :  Time  series.  Linear,  Non-linear,  (Jaussianity.  Stationarity.  Au¬ 
toregressive,  Exponential  Models.  PAR(l),  ARF(l).  F.AR(l),  TEAR(l),  NEAR(l). 
Robertson's  Fixed  and  Random  Models,  Correlation  and  Cumulant.  Structure. 
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1  Introduction 


Statistical  methods  based  on  moment  information  have  been  used  extensively.  In  terms  of 
model  identification  the  time  series  literature  has  been  devoting  a  considerable  attention  to 
the  problem  of  identifying  the  p  and  q  order  under  the  general  linear  lramwork  of  ARMA(p.q) 
modelling.  Second  order  correlation  informat  ion  (e.g.  acf  and  pacf )  became  a  main  tool  in  the 
process  of  of  selecting  p  and  q.  While  second  order  information  is  of  paramount  importance 
in  the  case  where  the  roots  of  the  AR  and  MA  polynomials  remain  outside  the  unit  circle, 
higher  order  cumulant  information  becomes  crucial  in  deciding  on  the  locations  of  the  zeros 
or  poles  of  possibly  non-invertible,  non-causal  and  non-CJaussian  ARMA  models.  Of  course 
there  are  many  very  useful  statistical  tools  for  solving  the  above  mentioned  problems  which 
are  not  based  on  moments.  For  example,  the  use  of  information  based  criteria  such  as  A  1C, 
MAIC  and  BIC  in  selecting  orders  of  an  ARMA  model,  the  use  of  MLE  in  locating  roots 
of  a  mixed  phase  ARMA  process,  ect.  While  these  non-moment  based  methods  might  be 
more  efficient  than  moments  methods,  the  moments  methods  are  generally  simpler,  easier 
and  intuitvely  appealing  both  in  t  heory  and  computation.  It  is  often  the  case  that  one  needs 
the  initial  point  supplied  by  such  a  method  to  start  an  efficient  but  complicated  non-moment 
based  method. 

The  introduction  of  non-linear  time  series  models  in  recent  years  (e.g.  bilinear,  threshhold. 
random  coefficient,  ect.)  amplified  the  importance  of  using  higher  order  cumulant  informa¬ 
tion  in  discriminating  among  the  various  non-linear  models.  It  was  shown  that  different 
models  are  capable  of  producing  an  identical  correlation  funetion  of  the  linear  autoregressive 
type;  thus,  giving  rise  to  a  class  of  models  characterized  as  'i'^-order  equivalent.  Conse¬ 
quently.  effort  s  tunc  been  diverted  to  the  analysis  ol  the  higher  order  cumulant  st  r  net  lire  with 
the  hope  of  exploiting  differences  among  the  models  at  higher  order  correlation  dependence 
struct  ure.  I  he  basic  idea  underlying  t  he  search  for  in  format  ion  in  t  he  higher  order  cumulant 
structure  in  order  to  distinguish  two  models  may  be  slated  as  follows.  Within  the  class  of 
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moment  determination,  moment  sequences  of  two  different,  stochastic  processes  cannot  be 

identical.  Specifically,  given  two  stationary  series  {.V,}  /  {V(}  there  exists  . uk) 

such  that  k'th- order  moments  or  cumulants  with  lags  . </*.)  of  { A , }  and  {V(}  are 

not  equal,  i.e. 

Gfl/,.  «2 . »a  )  ^  ,  H-2 . «*)• 

In  practice,  one  hopes  that  the  above  is  true  tor  a  small  order  k.  and  the  difference  is  large 
relative  to  a  given  sample  size.  Otherwise,  the  search  for  a  discriminatory  power  in  the  higer 
order  cumulant  structure  might  turn  out  to  be  fruitless. 

The  problem  of  discrimination  among  non  linear  time  series  models  has  been  considered  by 
many  authors.  Lawrence  and  Lewis  [21]  considered  special  3rt/-order  structure  of  the  form 

Cov{ Ft[p) ,  Xf+k)  ,  C'oe([/?jp)]2,  Xt+k) 

where  R\p^  are  the  linear  autoregressive  residuals  of  order  p  for  RCA  and  PAR  models  . 
Within  the  class  of  bilinear  models  Li  [26]  and  Oabr  [10]  considered  quantities  of  the  form 

Coi'(Xf,X'f+k) 

Cov(X?.Xt+k) 

respectively.  Auestad  and  Tjostheim  [4]  considered  the  use  of  non-parametric  methods  aimed 
at  the  conditional  mean  and  variance  of  various  non-linear  time  series  models.  Anderson  [1] 
approached  this  problem  differently  by  observing  differences  in  the  sample  paths  generated 
by  the  exponential  family.  I  sing  a  fluctuating  type  statistic  he  was  able  to  discriminate 
among  simulated  tract's  for  a  reasonable  number  of  observations.  In  his  work  the  moments 
do  not  play  a  role  in  the  proposed  discrimination  procedure  and  as  such  may  provide  an 
alternative  in  situations  where  moments  up  to  the  desired  order  do  not  exist,  Isay  [37] 
offers  a  very  general  method  (or  selecting  a  model  depending  on  the  type  of  characteristic 
one  is  interested  to  investigate. 

We  propose  a  new  approach  which  relies  on  the  conjecture  that  the  information  required  for 
discrimination  among  the  models  is  available  in  the  higher  order  moments  or  equivalent  Iv. 
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in  the  higher  order  eumulant.  structure.  Specifically,  we  shall  concentrate  our  attention  on 
the  third  order  eumulant.  structure  given)  by 

C{r,»)  =  A:[(A't  -  /'.,  !(  A',..,.  -  // j- )( -Vf-s  -  lh  )\-  (1.1) 

The  family  of  exponential  time  series  models  will  he  the  framework  within  which  wo  shall 
show  the  parametric  equality  of  the  correlation  (hence,  the  spectral  density)  functions,  and 
the  way  in  which  the  theoretical  higher  order  eumulant  structure  points  out  to  the  differences 
among  the  models.  We  demonstrate  the  method  for  a  restricted  case  where  we  consider  a 
family  of  non-linear  time  series  models  with  known  marginal  distributions  and  a  common 
A R(  1 )  equivalent  correlation  structure.  1  his  family  consists  of  marginal  exponentially  dis¬ 
tributed  time  serif's  models  which  include  : 

•  (i)  Product  Autoregressive  Model  [  PAR(I)  ] 

•  (ii)  Exponential  Autoregressive  Model  [  EAR(  1 )  ] 

•  (iii)  Transposed  Exponential  Autoregressive  Model  [TEAR(l)  ] 

•  (iv)  Newer  Exponential  Autoregressive  Model  [  NEAR(I)  ] 

•  ( v )  Robertson’s  f  ixed  Model 

•  ( v i )  Robertson  s  Random  Model 

In  addition  we  shall  consider  the  linear  autoregressive  model  with  exponential  innovation 
process  which  we  shall  call  ARE(  I  ).  As  opposed  to  I  lie  family  mentioned  above  the  ARE(  1  ) 
does  not  have  a  known  marginal  distribution  diowever.  its  moments  can  be  computed.  This 
model,  though,  shares  the  same  correlation  structure  as  the  non  linear  exponential  familv. 
i  lie  underlying  objective  is  to  discriminate  among  realizations  produced  bv  the  models 
we  consider.  I  his  task  is  impossible  to  accomplish  since  t  hev  have  ident  ical  second  order 
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structure.  We  should  note  that  the  models  being  considered  l>v  no  means  exhaust  all  such 
second  order  equivalent  ones 

I'he  plan  lor  the  paper  is  as  follows,  we  shall  start  with  a  hnei  review  ol  the  traditional 
approach  to  time  series  analysis  .  followed  !>v  a  present  at  ion  ol  the  lamily  ol  exponential 
time  series  models  in  section  •'{.  In  that  section  we  shall  stale  the  lorm  taken  up  by  each 
model,  show  the  type  ol  sample  traces  they  are  capable  of  producing,  and  develop  their 
correlation  functions,  t  hen  we  give  a  Intel  review  ol  higher  order  cimiulaiits  in  section  I. 
Subsequently,  the  results  we  obtained  lor  the  ib '-order  cumulaiit  structure  for  the  seven 
models  under  considerai ions  are  presented  in  section  •’>.  (ieneral  met  tautology  is  presented  in 
sect  ion  (i.  I  he  result  s  ol  i  In*  si  mu  la  t  ion  pari  are  the  topic  ol  sect  ion  7.  I  here  we  also  brie  I  Is 
discuss  the  way  in  which  the  sample  traces,  correlation  functions  and  -T  border  eumulants 
were  generated  empirically.  A  briel  conclusion  is  given  in  section  S. 

2  Stationarity,  Linearity  and  Gaussianity 

Over  t  lie  last  bd  years  st  at  ist  icians  have  developed  a  large  body  ol  t  lieory  ami  met  hods  aimed 
at  the  analysis  of  time  series  data.  A  comprehensive  account  of  their  work  culminated  in 
books  such  as  Kendall  and  Stuart  1 1  7],  Jenkins  and  Watts  [lb].  Box  and  Jenkins  [b] .  I  Ian  nan 
[12],  Anderson  [2j.  Brillinger  [7],  Chaliield  [9].  Koopmaus  [IS],  Priestley  [At )] .  Rosenblatt 
[d2],  and  Brockwell  and  Davis  j.s],  to  name  a  lew.  1  he  loundat  ions  ol  classical  time  series 
analysis,  as  described  in  the  above  references,  were  thought  to  be  based  on  two  underiving 
assumpt  ions,  si  ai  mg  t  li.it  : 

1.  I  lie  tune  sene',  is  >1  u I /<> it ti it/  to  an  order  ol  at  least  two.  I  lie  process  is  assumed  to 
i  ema  in  m  equi  I  ibrium  a  bout  a  constant  mean  level  with  the  proport  mu  ol  ordinates  nm 
exceeding  ,I||V  given  level  Is  about  equal  over  ally  I  line  illteival  spanned  bv  the  -ample. 
Ill  (Vise  t  he  ol  iserv  <•<  I  sene,  does  Hot  exhibit  sinll  behavior,  ll  Is  I U  it  I H '  I  assumed  tll.it 
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weak  slationarity  ran  be  achieved  by  applying  an  appropriate  I  ransformat  ion  e.g.  linear 
filtering. 

2.  The  time  series,  viewed  as  a  stochastic  process.  {A,,/  £  7  '}.  is  an  output  from  a 
lintar  fill( r  whose  input  is  the  trhitt  noist  process  {/,};  h<  Mice.  1  he  observed  sample 
realization  can  be  represented  as  a  linear  Innetion  of  past  and  present  values  of  {/,}  - 
a  one  sided  representation. 

In  recent  years  the  validity  of  these  twin  assumptions  -  as  reasonable  approximations  to 
sample  trace  realizations  -  has  been  questioned  as  data  from  a  wider  variety  of  sources 
became  available.  Coupled  with  advances  in  the  field  of  nondinear  dynamics  (deterministic 
chaos  theories),  research  in  the  field  of  non-st  at  ionary.  non  linear  and  non-Caussian  time 
series  methodology  have  been  in  progress.  Subsequent-  efforts  to  bring  nondinear  time  series 
literature  under  one  unified  framework  resulted  in  t ho  publication  of  books  like  Priestley 
[29]  and  Tong  [2t>].  'The  reader  is  also  referred  to  Moliier  [28]  for  a  collection  of  papers  on 
theory,  computational  methods  and  applications  in  the  area  of  nondinear  signal  processing, 
long  [dfj]  discusses  properties  of  the  (laussian  stationary  linear  model  (CSL.YI)  which  may 
possibly  be  violated  : 

•  (a)  I'inie  series  that  exhibit  strong  asymmetric  behavior  cannot  be  expected  to  confirm 
to  the  CSb.M.  Such  models  are  characterized  bv  symmetric  joint  cumulative  density 
functions  and  that  rules  out  asymmetric  sample  realizations. 

•  (b)  1  he  (ISh.M  does  not  give  rise  to  dusters  of  outliers  e.g.  sudden  bursts  of  large 
magnitudes  at  irregular  time  intervals.  Observed  time  series  in  socio-economic  related 
phenomena  do  lend  to  exhibit  groups  of  out  liens. 

•  ( c )  Sample  t  races  that  demonst  rate  st  rong  rydes  cannot  be  modeled  bv  t  he  ( fSI.M  since 

the  regression  functions  at  lag  (k)  i.e.  /  \  .  A  .  d  are  all  linear  due  to  the  assumed 

joint  normality. 
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•  (d)  The  Gaussian  process  {  A't}  is  reversible  i.e.  (A’, . .  A’<n )'  has  the  same  distribution 

as  (Affl, . \i)h  Reversibility  is  violated  in  the  presence  of  differences  in  the  rate  at 

which  a  sample  path  rises  to  its  maxima,  and  the  rate  at  which  it  falls  away  from  it. 
One  simple  way  for  investigating  departures  from  reversibility  is  to  plot  the  sample  on 
a  transparency  and  then  turn  it  over.  If  the  mirror  image  is  similar  to  the  original  plot 
then  the  series  may  be  assumed  reversible’  -  irreversible  otherwise. 

One  could  also  test  formally  for  Gaussianity  and  linearity.  Following  Brillinger  [6],  who 
pointed  out  to  the  potential  of  using  the  bispect  ral  density  function  as  the  basis  for  classifying 
a  process  as  linear  (and  possibly  Gaussian)  or  non-linear,  Subba  Rao  and  Gabr  [35]  and 
Hinnich  [13]  developed  formal  tests  for  linearity  and  Gaussianity.  The  tests  are  based  on 
the  constancy  of  the  normalized  bispeetral  density  function  under  the  assumption  that  { A’,} 
have  a  linear  representation.  Tong  [36])  provides  a  comprehensive  review  of  tests  for  linearity 
and  normality.  Priestley  [29]  considers  the  case  whore  a  stationary  process  does  not  fit  into  a 
linear  representation  and  concludes  that  "a  fortiori  many  types  of  non-st at ionary  processes 
would  also  fall  outside  the  domain  of  linear  models."  In  summary,  observed  time  series  do 
not  necessarily  conform  to  models  such  as  the  GSLM.  The  degree  to  which  a  time  series 
realization  represents  a  trace  generated  by  the  GSLM.  has  a  direct  bearing  on  the  usefulness 
of  estimating  an  ARMA(p.q)  model.  For  purposes  of  prediction,  forecasting  and  cont  rol  one 
is  better  off  taking  advantage  of  the  non  linear  (hence,  non-Gaussian )  structure  of  the  data 
during  the  modeling  stage.  If  indeed  the  GSLM  is  deemed  inappropriate,  oik*  has  the  choice 
among  several  families  of  non  linear  models.  We  shall  turn  to  some  of  these  explicitly  in 
section  3. 
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3  AR(1)  Type  Exponential  Models  (EAR) 


The  family  of  models  we  consider  here  is  that  ol  the  fxpont  nhal  a  uton  yn  ssirt  models,  which 
is  composed  of  the  FAR(  1 )  and  it  s  generalization  to  t  lie  / nin.aposf  d  t  spout  nlial  nittongri  ss in 
model  TFA R  ( 1 ).  and  t  he  newer  exponential  autoregrssire  \  1 A 1  i ( i )  model.  This  typo  of  time 
series  models  wore  proposed  by  (iaver  and  Lewis  [11].  Lawrance  and  Lewis  [20,  21],  Jacobs 
and  Lewis  [  1  l] .  Lawraneo  [lb]  and  further  developed  by  Lawrance  and  Lewis  [22.  2d.  24].  Also 
we  consider  Robertson's  Fixed  and  Random  models  [21],  and  the  Product  Autoregressive 
PAR(l)  model  proposed  by  McKenzie  [27]  -  where  all  models  being  restricted  to  a  first  order 
autoregressive  st  met  ure. 

In  contrast  with  other  non  linear  time  series  models  (e.g.  bilinear  and  threshold),  this  class 
of  models  is  an  attempt  to  capture  the  behavior  of,  possibly  observed,  time  series  processes 
with  explicit  marginal  exponential  distributions.  The  family  of  FAR  models  is  advocated 
as  a  way  of  relaxing  the  assumption  of  marginal  (iaussianity  which  underlies  the  Gaussian 
linear  stationary  model.  The  reasons  behind  tin'  choice  of  the  exponential  distribution  as 
the  marginal  distribution  are  given  in  Gaver- Lewis  [11]  and  Lawrance  and  Lewis  [2d].  The 
standard  linear  first,  order  autoregressive  process,  AR(1),  with  exponential  input,  ARF(l). 
will  be  used  for  comparison  purposes  in  section  5.  This  model  has  an  identical  correlation 
and  spectral  densily  functions  as  do  the  models  mentioned  above  diowever,  its  marginal 
distribution  is  not  known,  thus,  it  is  not  to  be  considered  as  an  exponential  model  but 
rather  as  a  linear  A li ( 1 )  model  with  exponential  input.  1  he  fact  that  it  is  linear  enables  us 
to  distinguish  it  from  any  other  non-  linear  model,  with  or  without  an  identical  correlation 
structure,  based  on  the  theoretical  result  stating  that  <i  process  with  a  linear  representation 
has  a  flat  (const  ant  )  normalized  Inspect  ral  density,  for  more  del  ails  see  Subba  Rao  and  ( iabr 


3.1  PAR(l)  Model 


A  natural  extension  o!  the  linear  A H ( 1  )  model  was  proposed  by  MeKen/ie  ['J7j  and  consists 
of  an  exponentiation  ol  the  linear  model  such  that  the  additive  lorm  is  being  transformed 
into  a  mult  iplieat  ivr  lorm.  Here'  we  consider  a  sepeial  ease  ol  t  lie  gamma  lamily  of  i mi rgi nails 
distributed  time  series  where  tin1  output  series  has  an  exponelial  marginal  distribution  ol 
unit  mean.  Specifically, 

■V/  =  id.  i) 

where  o  <E  (0.  1)  and  l,  is  given  by  a  mixture  ol  uniform  ( ().  rr }  and  exponential  mean  one 
random  variables  independent  ol  each  ot  her. 

1'h is  model  differs  Irom  the  others  we  eonsidei'  in  two  aspeds.  f  irst,  the  innovation  process 
dot's  not  posess  a  known  paramet  lie  densit  \  function  and  it  s  higher  order  rumulanl  structure 
is  expressed  in  terms  ol  the  moment  s  ol  A ,  only.  Second,  we  note  that  (d.  I  )  may  be  linearized 
by  taking  the  logs  ol  both  sides  ol  the  equation.  As  such  it  is  classified  as  an  inliinsiniUi) 
linear  model  i.e.  a  non-linear  model  which  can  be  linearized.  It  di tiers  from  the  following 
models  which  cannot  be  linearized  duo  to  their  switching  nature  and  are  to  be  considered 
under  the  class  ol  ini rinsu  <ilh/  non  linear  models  i.e.  a  non  lineal  model  which  can  not  be 
linearized. 

3.2  EAR(l)  Model 

In  the  lollowiiuj.  sol  up  we  let  {/p}  be  a  sequence  ol  i.i.d  exponential  (  \)  random  variables 
with  a  probability  density  lunction  given  In 


/  /  t  <  ) 


I  \f  x'  i  ■  II  .  \  •  (I 
|  t)  ot  herwise  . 


(d.LM 


175 


Wo  define  an  KAH(l)  model  as. 


.V,  =  +  z,  (3.3) 

f  /'-V,_ i  with  prob.  p 

^  /> A  f _ |  +  /'.(  with  prob.  (1  —  p) 

—  pXt- 1  4  It  hi  ( 3. a) 


with  (0  <  p  <  1)  and  {/,}  being  an  i.i.d  sequent  e  defined  by 


h 


0  with  prob.  p 
1  with  prob.  1  —  p. 


(34.) 


I  nder  this  formulation  {A/}  is  marginally  distributed  as  an  exponential  random  variable 
with  parameter  A. 

Caver  and  Lewis  [l  1]  point  out  to  several  characteristics  of  the  F.AR(l)  model: 


•  Setting  p  =  0  yields  the  special  case  where  { X, }  is  a  sequence  of  i.i.d  exponential 
random  variables. 

•  ~t  is  not  a  continuous  random  variable.  This  feature  distinguishes  (3.5)  from  the  usual 
linear  AR(  1  )  equation  with  ( laussian  or  exponent ial  input. 

•  I  he  representation  (3.4)  is  one  ol  a  tandom  linear  combination  of  an  i.i.d  exponential 
sequences;  thus.  can  be  easily  simulated  on  a  computer. 


One  problem  the  I  .  \  lb  1  i  mode!  lias  is  <  a  lied  ,-ero  delect  (see  I  .a  w  ranee  and  Lewis  32 ! )  ami 
relate  to  the  sample  paths  it  generates.  Specifically,  the  model  generates  paths  in  which 
large  values  are  loliov.ed  by  iiiih  of  decreasing  values,  with  the  runs  having  geomet  ricallv 
distributed  lengths.  i  |H-  large  Value-  a  1  !se  when  /  Is  ill)  |  111  led  i  l.e  /.  I  i  \\  fill)  the  falling 
values  stem  f  I  Oln  t  h<  ’  >  fei  em  ; ;  | :  1st  i,  pat1  ol  1  t  4  I  'l.e  /  1 1  \ , 


3.3  TEAR(l)  Model 


A  natural  extension  of  the  KAR(l)  model  is  to  interchange  the  role  o!  A,  ,  and  in  (A.n). 
This  does  not  affect  tin'  exponential  (A)  marginal  uisiribution  of  A,.  I’pon  replacing  />  by 
1  —  o  we  obtain  the  trunsposid  t  spout  lit  ml  union  yn  ssin  TKAR(I)  model 

\t  —  //A,_|  +  I  1  —  <>)/■,,  ( A.  i  ) 

f  A, _ i  +  (  I  *-  <>)/•;,  with  prob.  o  , 

1  (  1  —  o )  l'  i  wit  li  prob.  1  —  o 


where 


/,  = 


0 

I 


with  prob. 
with  prob. 


Note  that  in  this  cast'  the  innovation  process  is  a  continuous  random  variable  scaled  by 
a  constant  1  —  n.  The  behavior  of  a  simulated  path,  for  a  largo  o.  shows  geometrically 
distributed  runs  of  rising  values  (i.e.  /,  =  1 )  followed  by  sharp  declines  when  the  selection 
I,  —  I)  is  made.  The  decline  due  to  t  he  exclusion  of  the  previous  value  A’,_i- 
1  lie  LKARf  1  )  model  is  discussed  by  hawrance  and  Lewis  [22]  as  an  extension  of  t  he  KAR(  1  ) 
model.  However.  I  KAR(I  )  is  also  a  special  woe  of  Arnold’s  [Aj  exponential  model  driven  by 
past  innovations.  Specifically,  define  the  random  variabh's 


A,  =■  1  il  and  only  if  I  ,  —  1 


A  .  =  i  if  and  only  if  l =  0. 1  |  =  I) . . .  /  ’,_l4  |  =  I 


where  I  .  are  i.i.d  Bernoulli!  p)  random  variables  with  A,  being  distributed  identically  but 

not  independently  as  <  ieomei  ricf  o  )  random  variables  wit  It  domain  1.2.  A . 

I  he  model,  expressed  in  terms  of  past  innovations,  is  given  bv 

\. 

A  .  ~  o  ^  .  |  i  A.  1 H  i 

I 

where  lid  I'.xpl  A  i  and  the  sum  n  multiplied  b\  n  in  obtain  u  net  stat  iount  it \.  I  Ins 

!  ep!  e.eii  t  a  t  1<  ill  is  obi  a 1 !  C'd  1 1  one  e  v  press  the  f  |  \  R  i  i  i  mode]  .{  >  i  t  e<  i  it  .»|  V  e|  V  . 


177 


3.4  NEAR(l)  Model 


The  previous  two  models.  KAR(  I  )  and  1  l-.A R(  I  ),  arc  special  cases  ol  a  more  flexible  model 
ill  wliiclt  { A,_i }  in  (3.8)  is  scaled  by  a  coeliicient  A;  tlms,  simidated  realizations  generated 
by  such  model  are  of  interest  as  it  may  circumvent  the  problem  cd  ge-omet  ricall  v  distributed 
runs  of  falling  or  increasing  values  which  might  not  be  applicable'.  Specifically,  let  { -V, } 
denote*  the  time  series  variables  and  {/'.;}  be*  a  sequence  of  an  i.i.d  unit  mean  exponential 
random  variable's  acting  as  the  innovation  proee-ss.  'The  XKAR(l)  mealed  is  defined  as 


■V,  = 


~  / 


with  pre>l>.  ei 
0  wit  h  pi  ed).  1  -  o 


(3.1  1 ) 


—  ,il,  A  . _  [  +  s, 


(3.12) 


whe*re 


Ip  with  prob.  ]> 
b Ip  with  prob.  1  —  p 


h 


U  with  pred).  1  —  e> 
1  with  prob.  et 


(3.1  1) 


with  h  —  (  1  —  n  ).i  and  />  =  — - .  The*  parame'ters  e>  and  ;i  are  allejwee.l  to  take  values 
over  the-  domain  elefine-el  by  0  <  a.  A  <  1  with  n.A  ^  I.  Se-tting  (e>  =  1  .  (I  <  A  <  I  )  in 
(3.1  2 )  yie-lds  the-  KA R (  I  )  model.  wlie-re-  fixing  ( A  —  I  .  0  <  ee  <  1  )  give  rise-  to  t  he-  TKAR(  I  ) 
mealed.  Roth  are*  e-xt  re-me-  cases  of  a  NKAR(I)  proce-ss.  We  note  that  due-  te<  t  lie-  elist  ribtt 
t  ional  assumpt  ion  unde-rlving  { Ip } ,  t  he-  iunovat  ion  pre>ce-ss  is  not  allowe-d  to  take  on  ne-gal  Re¬ 
values  i.e.  I’ll.,  ()]  0.  It  is  edwious  how  the-  concept  ol  switching  conics  into  plav  in 

(3.12).  I  la-  switc  h  from  one  linear  piece-  to  the  other  is  controlled  bv  an  externa!  rainloii: 
mechanism  with  a  pre-spec  ific-cl  parametric  probabilist  ie  st  tin  t  lire. 
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3.5  Robertson’s  Fixed  and  Random  models 


Robertson  [31]  suggested  two  exponential  models  which  we  shall  refer  to  as  Robertson  s  fixed 
and  random  models.  Our  main  concern  is  to  show  that  these  models  cannot  be  identified 
via  the  correlation  or  spectral  density  functions  thence,  one  has  to  explore  the  higher  order 
cumulant  structure. 


3.5.1  The  Fixed  Model 

Consider  the  following  switching  structure 

y  f  ,V(_i  —  with  prob.  3 

1  1  with  prob.  1  —  ;i 


where  i  is  a  fixed  constant.  Et  has  a  truncated  exponential  distribution  given  by 


/£’(  0 


f  ()<<•<  —lu.i 
U  otherwise 


(3.16) 


with  the  marginal  distribution  of  A,  being  exponential  with  unit  mean.  Alternatively.  (3.15) 
may  be  represented  using  an  indicator  random  variable  i.e. 

.V,  =  l, (.Vf_,  -  /».*)  +  (!  -  l ,)!■:,  (3.17) 


where 


J  1  with  prob.  .i 
|  0  wit  h  prob.  1  —  .3. 


(3.  IS  ) 


3.5.2  The  Random  model 


One  may  generalize  the  fixed  model  by  allowing  . i  to  beionie  a  random  variable  which  a<  m 
as  a  mixing  <iist  i  ibut  ion .  with  domain  rest  ri<  ted  to  l  he  interval  jit.  1  y  Specilicall  v.  let  A  .  ha\  e 
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the  representation 


f  ,  -  In. if  with  prob.  .1, 

1  A’;  with  prob.  I  --  ti, 

or  stated  in  terms  of  an  indicator  random  variable 

^ t  =  1  ~  In ^t)  +  (1  —  l/)I'n 


(:ub) 


(3.20) 


where 


It  = 


1  with  prob.  ,it 
0  wit h  prob.  1  —  :it. 


(3.21) 


The  probability  density  assigned  to  -it  is  a  beta  density  with  parameters  (o,  2) 


AM) 


o(o  +  1 )( 1  -  :i).i"~'  0  <  ii  <  1  ,  a  >  0 

0  otherwise. 


(T22) 


The  distribution  oIln.it  is  obtained  using  t  he  standard  transformation  of  variables  technique. 
Let  Y  —  ln.it  t hen 


Ivilj) 


n(o  +  1 )( l  —  (  y)< l<!/  — oc  <  //  <  0  .  o  >  U 

U  otherwise. 


(3.23) 


The  probability  density  function  for  is  appropriately  modified 


1  t 


0  <  (  <  —In. i{ 
ot  herwise. 


(3.21 ) 


Within  t  his  framework  one  notices  t  hat  the  random  variables  I,  and  li,  are  not  independent 
as  they  both  involve  the  mixing  distribution  ,i,.  1  he  marginal  distribution  of  A(,  though, 

remains  exponential  with  unit  mean  by  construction.  We  remark  that  all  these  models  are 
Mat  ionary  in  the  wide  sense  i.e.  strictly  stationary. 


3.6  Summary 

We  recall  that  the  model-.  u r n  1 » -i  investigation  are  tin*  following  : 
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•  ARE(l)  : 


•  PAR(l)  : 


EAR(l)  : 


A /  '/'A,  i  f 


A,  -  A-.E 


A(  - 


P^t-i  with  )>rob.  p 

p A,_i  +  l't  with  prol».  1  —  p 


TEAR(l)  :  (p 


—  n 


•V, 


NEAR(l)  : 


where. 


A?  = 


A ,_i  +(1  —  n)E,  with  prob.  o 
(I  —  a)E,  with  prob.  I  —  rv 

wit  h  prob.  a 

s,  wit  h  prob.  1  -  o 


_  j  E,  with  prob.  ;> 

[  with  prob.  1  —  p 

1  -  l) 

p  =  - - r  b  =  ( )  -  a);) 


1  -  b 


Roberston’s  Fixed  Model  : 


where 


f  Y,_,  -  lu.)  with  prob.  .i 
1  \  Et  with  prob.  1  -  J 


//;(<)  =  { (;v_<  lnti 


Roberston’s  Random  Model 


^  j  A‘_i  —  In.),  with  prob. 

E,  with  prob.  1  —  .i, 


where 


./At  A)  - 


[  '<  ot  lierwise 

o(o  4-  1  )(  1  —  0  <  A  <  I  .  o  >  0 


0 


ot  lierwise. 


(3-25) 


(3.26) 


(3.27) 


(3.31) 
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I  able  I :  ( 'uncial  ion  Fund  ions 


ARK(l) 

RAR(l) 

KAU(l) 

TKAR(l) 

NKAR(l) 

Robertson's  Fixed 

Robertson  s  Random 

o* 

o* 

M)" 

r 

■ .  _ 

For  all  models,  but  Robertson's  and  1*AR(1),  the  input  process  { h, }  is  assumed  to  be  an 
i.i.d  exponential  sequence  of  unit  mean  and.  wit  li  t  lie  except  ion  of  AKH(  1  ).  t  he  out  put  {  A , } 
has  a  marginal  exponential  distribution  with  mean  one.  The  correlation  functions  for  the 
various  models  are  given  in  table  1. 

Figures  1  -  it  contain  simulated  traces  produced  by  the  various  models.  Note  that  we  indexed 
tin'  parameter  values  ot  each  model  such  that  tin'  correlation  functions  produce'  identical 
[•('suits  i.e  />(.s)  =  (0. 1  )•%((). :>)\(().7.r>)\ 

4  Higher  Order  Cumulants 


Let  { A'f }  be*  a  real  valued  strictly  stationary  random  process  and  let  /// ( / j , / 2 . I k)  be  the 

F  -order  product  moment  i.e. 

"'(h-t* . h.)  =  . .  .  .V,J.  (1.1) 


For  a  stationary  process  ol  order  A-,  we  can  write  (1.1 )  as 

tnUiJ-i . /*)  -  o/(()./,  -  -  /, . /;..  -  /,  ).  (1.2) 

Now  let  the  charac  tens!  ic  fnnclion  (cl)  of  { .Vf }  be  defined  by 

o  v (Ci-  G . Ck)  -  +  ( I.:t) 
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then  the  'lav lor  series  expansion  of  (1.3)  about  the  origin  is  given  by 

<M0  =  /  |e  - 'Ul^; ' :  - *  1  -V„  A',, . . .  .Vtj )  +  Of  |C  |*)  j  dV  (1,1) 

=  £  A',,1  +  G([  c  ]*)  (1.5) 

j= i  T 

=  £  . m  +  OIKI*!  «.«) 

./=  1  T 

where  |  C  |=  (eLiC,"}"  and  /•’  =  /'.vtl..Y<.2 . y ,k(.vti..rt2 . r,J  being  the  joint  cumulative 

distribution  function  . 

The  logarithm  of  the  cf  (1.3)  is  defined  as  the  nnnulant  generating  function  (cgf) 

A  v(  CnCi . U)  =  /wy{  /v[f  *'»<.  A'«,  +<aA'.a+...+cfc.Y,fc>] }  (.L7) 

such  that  (  1 2 . /;■).  the  kth-or<\cr  joint  cumttlant  of  the  set  of  random  variables 

{ A',, ,  .V,2 . X,k } .  is  the  coefficient  of  (Ci ,  G  -  -  -  • .  (k )  in  the  Taylor  series  expansion  of  (1.7) 

about  the  origin.  Specifically 

A  a  ( O  ^  ^-'(C-!-C2.,::-C<)r.y/,,/.2 . /J  +  0(|  C  \k)  (i.s) 

j=i  a 

vvliere  h  j(/|.  G . /,)  =  ( ’umulant  ( A,, .  A'<2 _ ,  A’,()  .  We  note  that  thecumulant  of  order 

greater  than  two  ar<’  all  zero  lor  a  Gaussian  process.  1  his  fealure  is  used  extensively  in  signal 
processing  to  suppress  (laussian  noise. 

I  he  relationship  between  moments  and  cumulants  were  formalized  by  beonov  and  Shiryaex 


[25]  and  are  given  by 

. /,</)  -  /-[A..,  Y,_  ...  A',J  V  ( 1/',  )<  '(/'_.) . .  .( '(/',.)  (1.!)) 

where  the  sum  is  taken  over  all  partitions  In, . vr)  which  is  a  partition  of  (/, . tL  ). 


Relationship  (  I.!))  implies  that  we  can  write  t  lie  moments  in  terms  of  the  cumulants  and  if 
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we  invert  (1.0)  then  one  can  write  t  lie  cnmnlants  in  terms  <>|  the  corresponding  moments; 
hence,  t lie  inversion  ol  (1.0)  yields 

. .VrJ-*D-l )"*>-  1  )|/;(  n  v , )  /  (  n  -v;)  O-KM 

»'  j€"i  /Cl',. 

and  if  the  process  is  k"‘ -order  stationary  then  we  may  write 

<Vish,...Jk)  =  i . 

=  (  (  t\.  t-i . n_  i). 

From  (1.10)  it  is  seen  that  the  cnmnlant  <"Cn.C> . tv  i )  is  a  /.-"'order  polynomial  in  the 

moments  of  no  higher  than  k  and  conversely,  the  A-'*' -order  moment  . /*■)  is  a 

/.-"'-order  polynomial  in  cnmnlanls  of  order  no  higher  than  /,-.  Consider  the  speeilic  cases  : 

c,(0)  =  i<:[xt]  =  fi, 

<'A»)  -  /'(*)- /c2 

<  'af^i  ■  •"■2)  -  yd  -s  1  •  -s2 ) 

-  |yd-si)  +  yd-sj)  f  y/(s-2  -  s, )}//.,.  t  2/r 

f  |(-S]  . -Sj.  S;t)  y/(.S|.  Sj.  >;,) 

y/ ,  j yd sj  -  -si .  s;t  —  s  1 )  4-  /z(.s-2, s.i)  +  yd-si--si)  +  yd  si .  .m) } 

+  2//f.{//(.S|  )  +  //(.v>)  -1  //(.s  t  j  +  -  .s,  )  f  -  .S|  )  +  y/(.*M  -  Sj)  } 

yd-^i  )yd-s;t  -  *.>)  -  /dV)yd\i  -  >,)  yd-'Mdd'-o  -  si)  - 


where 


yds) 

-  /.; A  -  \  ,  ,i 

yds,,  .s., ) 

-  /'  |  V;  A  /  . 

yd  si .  -m  ) 

/■- 1  -\  /  A  /  (M 

A  ,  .  A 
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Consequently,  one  may  writ  e  (  ':t(  .s , .  s2 )  in  I  lie  lorm 

C(r.. s)  =  A[(A,  -  fiT)(Xt+r  -  //..  )(A'[.f.s  -  /'.,)]  (4.1  1) 

=  A(A( A/+r A,+.s)  —  //,.[/■,(  A,  A,+). )  y  l•'(Xt.\t+s)  +  A(  A,+r  Af+S)]  +  2//j. 

and  C,i(.s1..‘i2,,S;!)  may  he  expressed  as 

C(r.,.u)  =  E[XtX,+rXt+aXl+u]  (4.12) 

—  //j.(  A  [  A ,  A ,+s_r A (+„_r]  -f  A [A,A,+S  A,+„]  +  A[Af A/+r A,+s]  4-  A[Af A,+r A/+u]) 

+  ‘2//^(/t[A.  Xt+r]  +  A[A,Af+s]  +  A[AtA,+u] 

■f  A[A /  A f+s-r]  +  A  [A  (A, 4-  A  [A,  A, 

—  A  [  A ,  A  t+r]  A  [A  f  A  —  A  [  A ,  A  ( + s  ]  A  [  A  ( A , + u  _ ]  —  A'[A(Ai+u]A[A,Ai+s_r]  — 

For  a  detailed  account  of  the  relations  bet  -ween  moments  and  cumulants  t  he  reader  is  advised 
to  consult  Kendall,  St  uart  and  Ord  [  !(>].  Oumulant s  and  their  relationship  to  spectral  analysis 
are  discussed  by  Sesay  [34]  and  Rosenblatt  [33],  Sesay  [34]  discusses  the  various  uses  of 
cumulants  and  cumulants  spectra,  specifically 

•  Cumulant  spectra  is  used  in  tests  aimed  at  discriminating  between  linear  and  non-linear 
non-Ciaussian  processes  (see  Subba  Rao  and  (iabr  [35]). 

•  The  fisvmplot  ic. distributions  in  some  non -linear  theory  may  be  obtained  using  cumu¬ 
lants. 

•  l  ime  reversibility  may  be  determined  by  verifying  C(  —  si . — •sr—  i )  -  C(.s, . sk_ , ) 

or  equivalently  the  imaginary  part  ol  the  kl‘  order  spectrum  is  equal  to  zero. 

•  ( ’ross-eumulant s.  and  cross-eumulant  spectra,  can  be  used  in  the  estimation  ol  the 
parameters  of  a  non-linear  difference  equation  through  the  use  of  transfer  functions 
that  arise  in  the  Volterra  expansion  (see  Priestley  [30]). 
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5  The  3"  -Order  Cumulant  Structure 


In  the  following  we  shall  present  the  it  '-orilrr  ciiiiuila nl  st  ru<  I  ure  lor  cadi  of  1  lie  models 
discussed  in  sect  ion  3.  l  ot  each  o|  i  he  models  a  closed  lorm  solid  ion  to  t  he  it'  1  -order  cumulant 
structure'  is  given.  !  hese  solutions  are  hast'd  on  dosed  form  expressions  obtained  for  the 
expectation  1  erins  which  define  1  he  -5 '  1 -oi  <ler  cu inula nl  st  met  in e.  lor  all  hut  A  R  l.(  I  )  model 
the  out  put  process  is  marginally  (list  n  hi  i  ted  as  an  exponential  process  with  unit  mean.  I  he 
results  presented  in  this  section  are  based  on  the  marginal  moments  given  hv 

!‘r  ~  1  t'j.i  -  -  -  l> 

1  lit'  input  process  is  taken  as  an  i.i.d  exponential  process  with  unit  mean:  lienee,  with  identi 
cal  moment  s  as  st  at  ed  above.  Robert  sou  s  and  1*  A  R(  1  )  models  form  a  separat  e  class,  in  this 
respect,  since  the  innovation  process  is  defined  by  a  sequence  ol  i.i.d  truncated  exponential 
random  variables  and  a  mixture  ol  exponetial  and  uniform  random  variables,  respect  ivelv. 
I  he  introduction  of  a  mixing  distribution  in  Robertson's  random  model  further  complicates 
t  he  si  rncl  life  of  the  in  nova  t  ion  process.  I  aides  Lb  it  and  1  list  t  he  it '"'-order  cumulant  st  ruct  lire 
for  those  models.  We  recall  that  t  h<’  models  under  investigation  are  given  bv  (it.’Jo)-(it.it!  ). 

1  he  following  expressions  are  used  in  labels  2.  it.  1.  ■">  and  l> 

I 

1  ~-o 

') 

(  I  —  o'  ill  -  O ) 

fi 

( 1  -  o’) I  I  -■  o~ )( I  o) 

1  -  o' 

I  -  o 


l>. 
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/■-  [A  ’  j  l(  1  f  o  )  ,  r.  r  (tt.  1  ) 

/'  - 

/'  ~r  (  1  p)l> 

-in  A  (  1  -■  />')/>-] 

1  ,/r 

'.d  i  )  - 

1  i 

-  1  T  t 

1  -  (od) 

— 

!  . nd 

-  Vi”) 

1  -  A 

I  A 

A  = 

( i  d" 

a  -~- 

r\h>.i,i,\  ~  -  ----  4  /•;[/'•/,! 

l+o 

/>, 

/'.  [h/.i.  /,']  —  h'[ln.i,lt] 

-  /•:[/•- 

(1  +  o  )z 

,/  = 

/‘.  t  A  r  A  ^  _  |  j  —  />//,.;  -  (lftr 

1  l  •  — 

1  -  n r 

l  -  a 

i) 

o 

o  -i  1 

riln.il, j  ~ 

1.  Jn.il,  j  —o,o+  1  )  <  — 

[  (o  +  d  )■ 

i:\[  In.if  l,\  - 

1:  [( ln.i)~  1:  j  —  o(o  *-  1  )  <  — 

(  (o  + 

i,\  .. 

l  \r.,l;\  -  of  n  ‘  1  )  { - -  - 

Id  f  1 

/  /:/•! 

/  I /,'!  :  2o(o  +  1  i  |  ■  —  - 

(o  f  I  )i 


... 1 . 1 

(o  f 


fo  +  -)J 


♦  > 


(o 


-i  :> i-' 


Oiven  1 1 1<  •  information  summarized  in  these  tables  one  mav  standardize  the  rate  of  decav 
ot  !  he  correlat  ion  I  him  l  mu  such  l  )i;il  1  lie  <  i  >  r  r«  ‘1  n  1  ion  I  him  I  ions  are  ideal  i<  a  I  for  1  In 'sc  models 
lor  a  sd'd'ii  parameter  \alm-.  Our  u;oal  is  to  invest  i^ate  lm\v  would  the  A  border  cimiulaiit 
s  t  nirt  1 1  re  !  >eha  \  e  sul  i  |c(  t  to  a  s|  anda'di/ed  correlation  lunction.  |l  is  mil  conjecture  th.it 
one  mii»lit  he  aide  to  d  i  -<  1 1 1  m  na  t  e  aiiionii  signal  jialhs  produced  |>\  the  \arious  models  uti 


tin  laisis  1 1 1  hiulier  order  moments,  it  is  obvious  that  the  correlation  functions  can  not  In 


Table  1:  T'^-Order  (’’umulant  Structure  :  1  lie  Intrinsically  Non  linear  Models 


EAR(l) 

TEAR(l) 

NEAR(l) 

C(U,0) 

2 

2 

2 

C(O.r) 

2pT 

2or 

2(o  A)r 

(’(t.t) 

2<r[\  +  r(l  -  o)] 

f)Ar 

+  ■\{(iJy[firt;i(T)  -  1] 

+2/o[/m>M1  ,({l.\(r) 
-(oA)r-S.i(r)}  -%a(r)] 

T/d.-j';  \(  T ) 

C’(M  +t) 

-vr+a 

2or+l(2  -  o) 

2M)r+l[:U  +  2/o] 

+(oA)r/d.j 

+  -mu(7-)/d[//,  +  2nd] 

-[2{(oM)r(  1  +  o A)  -f-  oA} 

+/d  {A  >.i( ") 

+  r  +  1)  +  1}]  +  2 

C  [hJl  +  T) 

2 pT+2h 

2oT+/,[l  +  /;( 1  -  o)] 

(i(oA)rA,‘ 

+  1(0  J)T+h,ir,Afi) 

+2(oA)''/d 

+2(oA)  +l//;  |_'y.}{-/.\(^) 

-in  ‘i)h-'-,(h)} 

+  (dAr,o.+n(/t) 

+/C',.s(  h  )'/„.f(r) 

—  [- { ( o A)r(  1  +  (o .i)1, 1  +  (oA);' } 

+  /d  {'Sad7)  +  +  ..A  T  +  h) 

+  +  2 

more  informative  for  the  purpose  of  discriminating  among  the  models  :  ■■  .-|2~ and  r-y:j44y  ■ 
In  t  he  simulat  ion  context .  however,  since  t  he  eumulant  surfaces  decay  rapidly  towards  0,  i  he 
computation  of  these  ratios  become  difficult  as  we  attempt  to  divide  by  very  small  values  . 
these  numerical  considerations  unslabihxe  the  use  of  the  ratios  as  a  tool  for  discriminat  ing 
among  the  models.  I  lie  computed  ratios  (  as  functions  of  I  lie  lag  r  ).  indexed  by  a  sot 
of  parameter  value's  such  that  the  correlation  lunction  of  ach  model  exhibits  an  identical 
behavior  (e.g.  p(. sj  __  (()..*))'■)  are  also  given,  ligures  "  (>.  so  to  demonstrate  the  shapes  of  the 
expressions  given  in  the  first  and  fourth  rows  o!  tabled. 

(liven  the  plots  of  the  ratios  and  the  <  umulant  surlaees  lor  the  six  models  we  tnav  classify 


them  into  three  categories.  I. AIM  I  )  lonus  its  own  >  lass. 


Knbeitsnn  s  models  and  I  I  AIM  I  l 


fable  b:  2'  '  Order  ('umuliini  Structure:  I  In-  Intrinsically  Non-linear  Models  (coni . ) 


Robertson’s  Fixed  Model 

Robertson’s  Random  Model 

('(0.0) 

*> 

2 

C(0.r) 

2.C 

\Jer.  ...  _ 

i'(T.T) 

2dr(  1  -  r/iil) 

2D  [l  *j 

+  (o  +  2)«/»rh(r  -  1)  -  (r  I),/-1] 

+  ■;•(  r)[2u  f  c] 

C(l.l  +  r) 

2dr  +  ,(  1  -  In.i) 

be  +1 

~D[2( ■_>/»,  +  I)  -  c] 

+  "[';( 7 )  +  *;(r  +  1  )  +  •! 

-  2(/>  -  1 ) 

('(/;./;  +  r) 

•>.r  +  >‘(  \  -  hln.i) 

\L,rTr‘ 

-2pr  +  ',_l[2///>,  +  u] 

- 2 o1'  [«i(r  -1)4  lj 

+  or[(o  +  2)<il>,{;(h  -  1  )-(//-  1  )nh-')  -  2| 

+  '}(/')["‘D(7)  +  >',/] 

+  U ['  (  T  )  +  ~i  (  T  +  ll  )  +')(/()]  +2 

form  a  separate  group.  NKAR(l)  and  1\\R(I)  form  an  additional  class.  .Vole  that  the 
cumulant  surface  produccnl  by  N KA R(  1 )  is  a  combinat ion  of  KA R ( 1 )  and  'l  l', A R(  1 )  and  that 
it  looks  very  much  like  tin'  surface  produced  by  PAR(I).  llo\v<'ver.  tin-  two  models  seem 
to  differ  in  their  behavior  when  one  observe  the  plots  of  the  theoretical  ratios.  Closer  look 
at  the  vertical  axis  lor  NKAR(I)  and  PAR(I)  in  figures  5  and  (>  shows  that  the  ranges  are 
similar  and  much  smaller  than  the  ranges  o!  the  vertical  axis  lor  I  ho  other  models. 


6  Methodology 


In  the  following  we  propose  a  diseriminat ion  procedure  that  may  be  applied  to  the  models 
under  investigation  (.’i. 2b)- (2.2  1  )  or  to  any  set  of  competing  models. 

Let 


M  {  a  finite  set  •  !  Ln'.i  parametei  model*  \ 


190 


Table  (i:  Ratios  of  t lie  .’{"'-Order  ('umulant  Structure 


EAR(l) 

TEAR(l) 

Robertson’s  Fixed  Model 

f'(O.r) 

PT 

1  +  ( l  -  o)r 

1  —  rlihi 

“TTmTTT 

f(O.T-) 

Pl 

o(2  —  o ) 

ii(  1  —  In  1) 

C(h.h  +  r) 
H  O.r) 

p2h 

o'* [1  +  /;(!-  o)j 

;C(  1  -  hln,i) 

rfmTTT1 

P2~T 

(>(‘2  — <*) 

C(t.t) 

1  +  t(1-c») 

1  —  rtnil 

C’[h,h  +  t) 

Plh~r 

C(r.r) 

i+T-d'f*) 

1  ~  rln, -i t 

('(h.h  +  r) 

,3h-l(l-hlnii) 

HU  +  r) 

‘2  — itt 

t 

Our  objective  is  to  identify  the  most  compatible  model  r.i  t  M  *»illi  { A f } .  Specifically, 
given  {.Yt}”=1 ,  tine!  a  model  ni  £  M  such  that  in  ~  {Af}"=). 

Procedure  : 


1.  Compute  . . . uk)  .  A  =  0.1,2 .  a,  £  /  integer.  We  call  it  the  empirical 

A"*-order  cumulant  structure  based  on  the  data  {A'( 

2.  For  each  m  £  \1 

(a)  Kstimate  ,  using  {.V(},n_,.  the  parameter  0m  ( possibly  a  vector)  for  model  m. 

(b)  Compute  ('om(u\ . uk)  for  model  m  empirically  or  using  the  theoretical  cumu¬ 

lant  structure.  We  shall  call  it  Method  1  if  the  computation  of  the  cumulant 
structure  is  done  using  the  known  theoretical  cumulant  structure.  We  shall  call 
it  Method  2  if  the  computation  ol  t  he  cumulant  structure  is  done  empirically 
based  on 

3.  (liven  the  above  quantities  we  seek  to  minimize,  for  a  norm  |l  || 

MiUmgA/  II  (’A  It  I . Ilk)  -  <  1 . Ilk)  II  •  (<>•!  ) 

Alternat  ively. 

Mm  j|  j  j  \  A  | . At)  -  Cc.yAi . \k)  ||  (t>.2) 


191 


where  fem(^i . A/-.)  is  the  kth-or<U'v  spectrum  (i.e.  polyspee  I  mm ).  1  he  general 

distance  measure  may  be  specified  as  e-.g. 

II  9  11=  E  I  9  I"'- 
{«.}€>' 

'[’here  are  several  issues  that  need  to  be  considered  under  the  proposed  procedure.  First, 
various  properties  of  t  he  model  such  as  st at ionarit  v.  ergoelicit.y,  moment  conditions,  moment 
calculations,  parameter  estimation  and  simulation  aspects  of  sample  traces  must  be  investi¬ 
gated.  Second,  statistical  properties  of  the  formal  test  statistics  based  on  ((>.1)  or  (6.2)  have 
to  be  studied.  In  order  to  do  so  the  sampling  properties  of  the  proposed  procedure  must  be 
investigated.  In  the  following  section  we  consider  the  simulation  aspects  of  (6.1  )  and  present 
some  simulation  results  for  both  methods  1  and  2. 

7  Simulation  Results 

In  order  to  verify  the  possibility  of  discrimating  among  the  various  models  on  the  basis  ol 
their  respective  2"  -order  cumulant  surfaces,  it  is  necessary  to  obtain  reasonable  agreements 
among  the  theoretical  and  simulated  cumulant s.  In  the  following  we  discuss  issues  related 
to  the  simulation  aspects  of  the  sample  trace's,  correlation  functions.  T^-order  cumulant 
surface's  and  ratieis. 

7.1  Simulating  Sample  Traces 

I  he-  simulal  ion  aspects  e>l  t  lie'  NKAH.f  1 )  moelel  aiai  its  special  e  ase's.  FAR(  1  )  anel  TKAR(  1 ). 
we're*  ronxidere'd  by  Lawraiiee-  anel  be’wis  [20] .  1  he-  algorithm  I  hew  give'  is  being  use'll  ill  our 
simulation  to  geuierate  sample-  realizations  ten  the-  NKAR(l)  family.  Die'  subcases,  1\AR(1) 
and  1  KAR(I  ).  are*  simulated  by  se-tting  (e>  -  (ID!)  .  IK  .1  c  1 )  n<l  (d  —  022)  .  0  <  <t  <_  |  i 
respectively.  in  the-  same'  preigram  that  ge-ne-rale's  the  si n i u  1  a t e-e  1  paths  lor  NFAR(I)  model. 
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We  follow  Lawrance  aiul  Lewis  in  setting  I  lie  degenerate  parameters  to  0.99  so  to  avoid 
complications  in  t lie  simulation  ol  I  he  traces. 

Robertson's  fixed  and  random  models  are  g<  Derated  by  two  different  programs.  One  which 
allows  a  selection  of  a  branch  with  a  fixed  probability  and  one  which  allows  the  selection 
of  a  branch  with  a  random  probability  generated  according  to  a  beta  random  variable  with 
parameters  (o,2).  The  input  signal  is  a  truncated  exponential;  hence,  needs  to  be  simu¬ 
lated  accordingly.  Since  no  IMSL  subroutine  is  available  we  generate  a  realization  Irom  a 
truncated  exponential  random  variable  using  the  cumulative  distribution  function  technique. 
Realizations  from  t  he  AR(l)  model  an'  easily  simulated  and  no  furt  her  explanations  are  re¬ 
quired.  McKenzie  [27]  discusses  the  simulat  ion  of  PAR(l)  models.  I  he  innovation  process 
Vt  is  generated  according  to 

v  =  /•;* -tyr ) 

where  U  is  distributed  as  a  uniform  (0. 7r)  sequence  of  random  variables  which  is  independent 
of  K  -  a  sequence  of  exponential  mean  one  random  variables.  The  function  b  is  defined  by 

b(o)  =  *ino(xnmO)~',{sni( !  -  n)od-,l~  r). 

Thus.  {\,}  is  generated  as  a  mixture  of  uniform  and  exponential  sequences  ol  independent 
random  variables. 

All  the  simulated  paths  are  generated  by  FORI  RAN  programs  that  call  IMSL  subroutines 
which  are  used  to  simulate  eont  inous  uniform,  beta  and  exponential  realizations. 

7.2  Simulating  Higher  Order  Moments 

One  FOR  I  RAN  program  is  employed  in  simulating  the  correlation  functions.  dr/-U|(|or  cu- 
mularif  surfaces  and  certain  slices  of  these  surfaces  .  Smoothing  consideral  ions  lead  us  to 
simulate  each  model  30  t  imes  when' t  hi'  lengt  h  of  each  simulated  t  race  is  1  (II 0  data  points. 
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Tahir  7:  Distance  Measure  (ti.l  )  -  p(-s)  — 


PAR(l) 

EAR  (1 ) 

TEAR(l) 

NEAR.(l) 

Robertson's  Fixed 

Robertson's  Random 

PAR(l) 

0.20 

0.21 

0.17 

0.00 

0.58 

0.40 

EAR(l) 

0.018 

0.015 

0.08 

0.08 

1  ,10 

l.i-i 

TEAR(l) 

0.78 

0.70 

0.018 

0..37 

0.08 

0.03 

NEAR(l) 

0.12 

0.09 

o.:n 

0.008 

0.8-1 

0.03 

Rolx-rtson's  Fixed 

1.80 

i  .or, 

0.22 

1.00 

0.02 

0.U7 

Robertson's  Random 

0.73 

0 .00 

0.02 

0.31 

0.12 

0.05 

The  program  computes  two  expectation  terms  :  E[Xt .V(+r],  over  t  he  range  of  lags  0  to  9,  and 
E [A’t A’(+r A’(+r+.,],  over  the  range  of  lags,  -9  to  -9.  Then  the  smoothed  empirical  correlation 
function  and  the  smoothed  3rt,-order  cumulant  surface  are  computed  using  their  definitions. 
In  the  computations  of  the  expectation  terms  we  use  : 

j  1001 


1010 


£  -v..v 


t  +  r 


1=1 


1  ,<X)1 

E[.\  t  A  t  +  r 5  f+r  +  .s]  —  yo~j-Q  XI  )-r-Y  f  +  r  +  s 

In  order  to  de  termine  how  accuratly  the  simulated  cumulant  surfaces  match  their  1  heoretical 
couterparts  we  plot  the  empirical  correlat  ion  functions,  the  empirical  C(r.  r)  slice  and  the 
complete  simulated  surfaces  in  figures  7-9.  This  is  done  for  various  parameter  values  and 
shown  for  those  that  correspond  to  p(s)  —  (0.5)\ 

7.3  Discrimination  Procedure  :  Method  1 

The  results  of  the  simulation  study  arc'  summarized  in  table's  7  12.  Table's  7-9  are'  example's 
ol  typical  values  obt  ained  by  a  single-  run  of  the*  simulation.  Tables  1 0- 1 2  provide  <  he-  propor¬ 
tions  of  correct  model  identification  out  of  30  repetitions.  Note  that  in  table-  7  the  diagonal 
line-  contains  the-  minimum  value-s  of  rows  2  This  is  precise-ly  how  we  would  expect  the 
procedure  to  perform  for  any  parameter  value  indexing  a  standardized  correlation  function. 
However,  errors  occure  at  the-  lirst  anel  last  rows  where  the-  method  fails  to  se-le-ct  the  correct 


194 


table  8:  Distance  Measure  ( t> .  1  )  :  p(s)  =  (0.5) 


PAR(l) 

EAR(l) 

TEAR(l) 

NEAR(l) 

Robertson's  Fixed 

Robertson's  Random 

PAR(l) 

0.07 

0.11 

0.71 

0.02 

1.11 

1.13 

EAR(l) 

0.07 

0.01 

2.21 

0.30 

3.31 

3.12 

TEAR(l) 

3.59 

2.07 

o.osr* 

1  .30 

0.0S7 

0.10 

NEAR(l) 

0.0-1 

0.37 

0.03 

o.oos 

1 .07 

1 .00 

Robertson's  Fixed 

1 .02 

1.27 

0.32 

2 .30 

0.00 

0.13 

Robertson's  Random 

2 .30 

l.Ol 

0.07 

0.70 

0.20 

0.32 

Table  9:  Distance  Measure  (tj.l)  :  p(.s)  =  (0.75)-s 


PAR(l) 

EAR(l) 

TEAR(l) 

NEAR(l) 

Robertson's  Fixed 

Robertson's  Random 

PAR  ( 1) 

1.91 

1.12 

2.28 

0.03 

3.07 

3.02 

EAR(l) 

1.82 

0.02 

0.33 

0.85 

7.70 

7.00 

TEAR(l) 

1.11 

3.8ft 

0.-10 

1.25 

0.79 

0.93 

NEAR(l) 

1  .S3 

0.63 

3.26 

0.09 

1.19 

1.15 

Robertson's  Fixed 

1.82 

1.59 

0.31 

1 .63 

0.57 

0.70 

Robertson's  Random 

9.13 

10.38 

0.51 

5,32 

0.21 

0.30 

model.  The  PAR(l)  model  is  being  identified  as  a  NKAR(l)  model  and  Robertson's  Random 
model  is  being  identified  as  a  TF.AR(l)  model.  The  theoretical  plots  of  the  3rd-order  cumu- 
lant  structure  support  this  confusion  as  they  show  that  these  models  produce  very  similar 
surfaces  that  are  hard  to  distinguish.  In  table  8  we  note  that  the  procedure  fails  again 
to  select  PAR(l)  and  Roberson's  random  models.  Krrors  occur  at  the  tirst  and  last  two 
rows  of  table  9  where  the  procedure  fails  to  distinguish  PAR(l).  the  fix  and  random  mod¬ 
els.  The  incorrect  selection  that  appears  in  the  above  tables  is  consistent  with  our  previous 
remark  regarding  the  grouping  of  the  models  into  three  categories.  Robertson's  models  and 
TEAR(l)  were  identified  as  sharing  a  very  similar  .'{''holder  cumulant  structure  and  so  were 
PAR(  1 )  and  1  PARI  1 ).  1  Inis,  one  would  expect  to  have  difficulties  in  discriminating  among 
models  that  belong  to  tin*  same  family.  I  he  pattern  established  in  the  previous  tables  is 
consistent  in  the  :{()  repetitions  we  consider  in  tables  10- 12.  PAR(  1  )  is  consistent Iv  confused 
with  NT.AR(I),  and  1  KAR(I)  and  Robertson  s  models  stand  out  as  a  separate  class.  I  he 
random  model  is  by  large'  the  hardest  to  identify  and  typically  is  mistaken  for  TEAR(l) 
model.  Although  the  procedure  is  successful  in  identifying  TEAR(I)  and  the  fixed  model  it 
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PAR  (1) 


TEAR(l) 


NEAR ( 1 ) 


Robertson's  Fixed 


Robertson's  Random 


PAR(l)  (  EAR(l)  I  TEAR(l)  !  NEAR(l)  |  Robertson's  Fixed  |  Robertson's  Random 


0.0  0.0  |  0.0  |  1.0 


0.0  0.0 


(able  I 3:  Proport  ions  ol  (  urrcd  Ideal  dicalioii  />(■'•)  (0.3>) 


TEAR(l) 

Robertson’s  Fixed 

Robertson’s  Random 

TEAR(l) 

0.97 

0.03 

0. 1 

Robertson’s  Fixed 

0.13 

0.N7 

0.0 

Robertson’s  Random 

0. 1  7 

0.1 

0.73 

Table  1  1:  Proport  ions  ol  Correct  blent  ilicat  ion  :  p(s)  ),s 

. -  -  -  -  - 

TEAR(l) 

Robertson’s  Fixed 

Robertson’s  Random 

TEAR(l) 

0.7 

0.17 

0.13 

Robertson’s  Fixed 

0.37 

0.  Iti 

0.37 

Robertson’s  Random 

0.3 

0.3 

0.3 

is  t  lie  confusion  in  select  ini!,  t  lie  random  model  that  male's  it  dillieult  to  judge  I  lie  adtupiacv 
of  TKAH(l)  or  the  fixed  model.  However,  since  the  three  models  share  very  similar  traces 
and  3r'f-order  eumulant  st  met  lire  one  may  choose  to  accept  each  ol  the  three  as  compatible 
wit  h  any  of  t  hat  group. 

To  reined v  this  problem  we  may  apply  the  proposed  discrimination  procedure  to  the  1m/ 
order  cun  in  la  lit  st  picture  for  t  hose  t  liree  models.  One  may  argue  that  since  the  models  share 
an  ident  ical  3r,/-order  moment  structure  and  a  similar  3' 'border  eumulant  st  met  ure  ( but  too 
similar  so  their  dillerences  can  not  be  captured  by  ((>.  1  ) ).  t  lien  it  might  bo  possible  to  reveal 
their  true  identity  through  the  use  of 'the  V1  order  oimuilanl  structure.  Tables  13  In  contain 
the  result  s  of  1  lie  simulat  ion  st  ud  v  applied  to  the  I"’  order  eumulant  structure  ol  I  I'.  A  It  (  I  ) 
and  Robertson's  models.  I  he  choice  among  the  models  is  not  clear  cut  as  the  proportions 


Table  TV.  Proportions  ol  Correct  Merit  ilicat  ion  :  p(>)  ~  (0.i->)s 


TEAR(l) 

Robertson’s  Fixed 

Robertson’s  Random 

TEAR  (1 ) 

0.03 

0.3 

0.07 

Robertson’s  Fixed 

0.13 

0.37 

0.0  1 

Robertson’s  Random 

(1.  1 7 

. 0.  17 

turn  j 
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Table  l(>:  Proportions  oi  (eure'et  hle-ut  ilicat  ion  :  p(.s)  —  (0.25)' 


ARE(l) 

PAR  ( 1 ) 

EAR  (1 ) 

TEAH(l) 

NEAR  (lp 

Hob's  Fixed 

Rob's  Random 

ARE(l) 

1.0 

0.0 

0.0 

0.0 

0.0 

o.o 

0.0 

PAR(l) 

0.0 

0.73 

0.0 

0.0 

0.27 

0.0 

0.0 

EAR(l) 

0.0 

0.0 

0.0.3 

0.0 

0.07 

0.0 

0.0 

TEAR(l) 

0.0 

0.0 

0.0 

0.0.3 

0.0 

0.03 

0.3.3 

NEAR(l) 

0.0 

0.3 

0.03 

0.0 

0.07 

0.0 

0.0 

Rob's  Fixed 

0.0 

o.u 

0.0 

0.07 

0.0 

0.07 

0.20 

Rob's  Random 

0.0 

0.0 

0.0 

0.3 

0.0 

0.23 

0.47 

of  correct  identification  are  not  large  enough  to  enable  a  reasonable  degree  of  discrimination 
power  among  tin'  three  competing  models.  Ibis  result,  was  expected  to  hold  given  the 
theoretical  expressions  as  expressed  through  the  plots  for  the  theoretical  1  "border  cumnlant 
structure,  figure  It).  In  t  hese  plots  the  models  are  shown  to  produce  similar  behavior  at 
various  frames  of('(r.s,u);  t  hus,  there  is  no  reason  to  expect  a  high  degree'  of  discriminat  ion 
power  among  t  he  models  on  t  ho  basis  of  t  he  proposed  procedure  and  t  he  j order  cumnlant 
struct  lire. 


7.4  Discrimination  Procedure  :  Method  2 

In  table's  Hi- 1 S  we  provide'  the'  results  ed' our  simulation  study  aeremliug  to  (ti.l)  base-el  e>n 
the-  empirical  cumnlant  structure  only.  .\:e>te>  that  we  aelde-el  AKK(l)  for  comparison  pur 
petses.  Si  nee'  the-  marginal  moments  e>f  ARK(I)  are'  dilfe're'tit  freun  the'  reunaining  moeh'ls  we- 
stanelarize  its  me'an  ter  e-ejuaj  one-  see  tlie'  mean  e>t  t he'  expoue'ut ial  inueivat  ie>n  preiee'ss  beveune's 
1  —  o.  The-  higher  emh'r  meune'iits  are'  not  stanelarize'el  le>  espial  1 1 u >se>  e>l  the  e'xpone'iit ial 
me>e|els.  I  he'  re’sults  in  table's  Hi  IS  are*  by  large-  consist <'iit  with  the-  re-sulls  ohtaiueel 

unele'r  the-  previous  m<*lhod.  I  lie  main  elille'ie-ne  e'  appears  tee  be'  in  the-  improve'd  se'paration 
be't  vve'e'ii  I  *  A  li  (  1  )  anel  NT,  AH(  I )  unele'r  the  se'ee>nd  me-t  hen  I  while  uneler  t  he  •  lirst  met  hoel.  which 
in  vedve'el  the-  t  he'eue’t  ieal  cumnlant  st  met  lire*.  I'AR(I)  is  eemsist  ant  ly  mistake'll  lor  NKAR(l). 
We'  use-  me't  hex  1  2  with  t  lie-  Vh  eueler  empirie  a  I  ni  mu  lant  st  met  lire-  lor  I  K  A  U(  I )  and  Kobe-vt 
sem’s  meulels.  I  lie-  re-sults  are*  summarized  in  (al)le’s  I!)  21.  figure-  II  contains  I  lie'  plots  of 
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I  able  IT:  Proport  ions  ot  ('urrcd  |<lcnt  die  a  t  ion  :  /<(.-)  *  1 1 .  r>  i ' 


ARE(l) 

rpAR(i) 

ear;d 

TEA11II) 

NEAR!  11  '  H-. 

li’r-  Fixi  cl 

Roll 

*s  Randm 

o  1 

AREin 

1.0 

0.0 

O  ) 

0  1) 

o  II  ] 

0  0 

0.0 

j 

PAR  (11 

0.0 

o.sr 

o.o 

■’ 1  '■  •  i  - 

Oil 

_J 

0.0 

EAR  (1 ) 

0.0 

0.0 

_  >■'». 

0.0 

0  (l 

d 

0  0 

1 

TEAR(l) 

0.0 

0.0 

C"._ 

OF1 

«.  o 

o  j7 

J  l.'i 

NEAR(l) 

0.0 

0.0 

Oil 

<  J  .  '►  ( 

0.0 

no 

Rob's  Fixed 

0.0 

0  0 

0.0  !  0.07 

0.0 

n  f ;  :> 

O'. 

Rob's  Random 

0.0  |  (Ml 

0.0  \  0.  1 

0.  1 

o 

la!)lt-  IS;  Proportions  ot  Correct  blent  ilical  ion  :  pis)  —  (0.7a P 


the  simulated  P*1 -order  nimulant  structure  for  the  three  models.  1  he  res. ills  conlirm  our 
previous  ('oimiK'iit  regarding  the  diffic  ulties  eiicomiterel  by  the  disc  rimination  procedure  in 
distinguishing  among  these  three  models. 


8  Conclusions 


I  he  problem  ol  diseriminat ion  among  non  linear  time  series  models  is  considered  in  this 
paper  through  the  family  of  exponential  models.  In  this  specific  ease-  we  arc'  aide  to  develop 
t  tic-  t  heoret  ical  d^-order  nimulant  st  met  lire  and  confirm  it  by  si mula t  ion.  The  procedure  we 


I  able*  l!>:  Proportions  ol  Correct  ident  ilical  ion  :  p(>)  —  (0.2’>F 


TEAR(l) 

Robertson’s  Fixed 

Robertson’s  Random 

TEAR(l) 

o.  in 

n.-jr 

Odd 

Robertson’s  Fixed 

Odd 

Odd 

oTdd 

Robertson’s  Random 

0.10 

o.:t:t 

Od  7 
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f  able  20:  Proport  ions  ul  (  on  e*  I  I « I > - 1 1 1  i I i <  ,it  ion  :  (>[*)  ( U . ) 


rtEAR(l) 

TEAR(l) 

0.27 

Robertson’s  Fixed 

0.2.5 

Robertson’s  Random  j  (1.27 

Robertson’s  Fixed 

Robertson’s  Random 

(1.20 

O.a.'f 

II.  ID 

o.:57 

(I..5 

o.  id 

fable*  21:  Proportions  o!  (orreet  blent  dnat  ion  :  />(  * )  =  (0.7-V)' 


j 

TEAR(l) 

Robertson’s  Fixed 

Robertson’s  Random 

j  TFAR(l) 

n.iin 

0.dll 

0.10 

!  Robertson’s  Fixed 

0.27 

o.:{:5 

1).  10 

|  Robertson’s  Random 

0.27 

.  11:111  ..  .  2 

0.1:5 

propos('  is  not  rest  rieteal  to  I  hr  class  ol  A l{ ( 1 )  typo  models  or  1  ho  class  of  models  for  which 
analyt  ical  results  for  t  hr  .T'^-ordrr  cnmulant  st  ruct  tire  are1  available*.  It  is  a  general  procedure 
with  the  potential  for  a  wide  range  ol  non  linear  models.  It  is  based  in  the  understanding 
that  different  models  cannot  have  an  identical  moment  sequence;  hence1,  the  discrimination 
among  them  would  become  possible  at  some  stage1  in  the-  higher  e>re!rr  cnmulant  structure*. 
In  our  specific  case  we*  are1  able-  to  obtain  a  significant  improveme'iit  in  our  discriminatory 
power  just  by  going  one1  ste’p  above1  the1  1  radit  Menai  second  order  mome'iit  analysis  i.e>.  the1 
e-orre-lat ie>n  funetiou.  While1  sevonel  orele'r  mennemts  play  a  dominating  mle-  in  linev.r  merele1! 
discriminal ie>n  t lie >y  are  very  limit e *e  1  in  the*  nou-lme'ar  evisen  \\  he*n  the1  2"  e>reler  ana!' 1 ’«  Oofs 
to  preeviele1  enough  infe>rmat ie>n  we  pmpeese  to  apply  higher  eu<l<T  momeMit  analysis  for  the* 
purperse1  of  moeh'l  dise  riminat  ie>i i . 
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Summary 

In  this  article  we  discuss  ways  in  which  moments  are  used  (a)  to  approximate 
distributions,  and  (b)  to  test  fit  to  a  given  distribution. 

1  Approximating  distributions  using  moments 

Solomon  and  Stephens  (1977)  give  a  number  of  examples  of  statistics  X  for 
which  the  first  few,  or  even  all,  the  moments  or  cumulants  may  be  found,  but 
whose  density  /(x)  and  distribution  F(x),  assumed  continuous,  are  intractable. 
A  good  example  is  the  statistic  S  whose  distribution  is  the  weighted  sum  of 
independent  chi-square  variables,  each  with  one  degree  of  freedom,  written 

s  =  £>(«<)2  (i) 

i  =  l 

where  u,  are  i.  i.  d.  N( 0, 1),  and  A<  are  known  weights.  Many  quantities  in  statis¬ 
tics  have  distributions  (often  asymptotic  distributions)  like  5;  for  example,  the 
Pearson  X2  statistic,  used  in  testing  fit  to  a  distribution  when  the  distribution 
tested  contains  unknown  parameters  which  are  estimated  by  maximising  the 
usual  likelihood,  rather  than  the  multinomial  likelihood,  has  this  distribution 
with  some  A,  1.  Other  goodness-of-fit  statistics,  of  Cramer-von  Mises  type, 
based  on  the  empirical  distribution  function  (EDF),  also  have  such  asymptotic 
distributions  (see,  for  example,  many  examples  in  Stephens,  1986a). 

One  of  the  first  examples  of  S  to  be  taoulated,  for  k  =  2,  involved  errors  in 
target  hitting  during  World  War  2:  tables  for  S  were  produced  with  some  labour 
by  Grad  and  Solomon  (1955)  using  analytic  methods.  These  have  been  extended 
by  various  authors  to  higher  values  of  k,  but  the  analysis  after  k  =  5  or  6  rapidly 


218 


becomes  very  difficult.  Thus  in  general  it  is  difficult  to  find  exact  percentage 
points  of  5,  but  the  cumulants  /cr,  r  =  1,2,...,  are  very  easily  obtained: 

«'  =  X>F2r_1(r-1)!  (2) 

1  =  1 

2  Moments  and  cumulants 

In  this  section  we  list  definitions.  The  r-th  moment  about  the  origin  of  a  random 
variable  X,  or  equivalently  of  its  distribution  f(x),  will  be  called  ji'  ;  the  r-th 
moment  about  the  mean  will  be  pr.  The  moment  generating  function  Mx(t)  of 
X  is  defined  by 


Mx(t)=  etxf(x)dx;  (3) 

J  —  00 

when  expanded  as  a  Taylor  series, 

«*«)  =  ! +*‘+1r+ff!+-+i£+--  « 

where  /i  =  fi\  is  the  mean  of  X . 

Cumulants  /cr  are  defined  through  the  cumulant  generating  function  Cx{t)  = 
log  Mx{t),  where  “log”  refers  to  natural  logarithm.  Then 


_  ■  K2<2  «3 13 

Cx(t)  —  «i t  +  — -1 — —  + 


2! 


3! 


Krtr 

.+  -!1_  +  . 


ri 


(5) 


Thus  in  principle  we  must  find  Mx(t )  before  finding  Cx{t)- 

The  following  relationships  exist  between  low-order  moments  and  cumulants: 
m  =  =  fi\  Ki  =  H2  =  cr2;  k3  =  ns',  ka  =  -  3p|-  Further  relationships  may 

be  found  in  Kendall  and  Stuart  (1977,  vol  1). 

Suppose  Z  =  X\  +  X2  +  X3  +  . . .  +  Xk  where  Xi  are  independent  random 
variables.  Then  a  property  of  moment  generating  functions  is 


Mz(t)  =  MXx(t)  Mx,(t)Mx>(t)  MXk(t), 


so  that 

Cz(t)  =  CXx{t)  +  CX3(t)  +  ■■■  +  Cxk(t),  (6) 

and  it  quickly  follows,  using  obvious  notation,  that 


Kr(Z)  —  Kr(Xi)  +  Kr  (X  2  )  +  •  •  •+  KTXk-  (7) 


This  additive  property  makes  it  very  easy  to  find  cumulants  of  sums  of  inde¬ 
pendent  random  variables,  and  hence,  for  example,  the  cumulants  of  S. 
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Two  important  Mx{t )  are  those  of  the  N{^i,<j2)  distribution,  Mx(t)  = 
exp (/rt  +  cr2t2/ 2),  and  the  \p  distribution,  Mx(t)  =  1/(1  -  2 t)p/2.  Finally, 
it  is  easily  shown  that  fiJaX  4-  *)  =  rritr(X).  for  r  >  2,  where  a  and  b  are  any 
real  constants,  and  Kr(aX  +  b)  =  arKr(X),  r  >  2. 

As  an  example,  consider  S.  If  X  has  a  \/2  distribution,  the  MGF  of  X 
is  1/(1  -  2<)1/2;  thus  G’x(t)  =  log(  1  -  2i),  and  expansion  gives  Cx{t)  = 
t  +  2t2  +  +  •  •  ••  Thus  the  r-th  cumulant  of  X  is  nT  =  2r-1(r  —  1)!, 

that  of  AjA  is  A^/c,.,  and  by  the  additive  property  (7),  the  r-th  cumulant  of  S 
is  given  by  the  expression  (2). 

3  Mathematical  approximations 

The  approximations  in  this  section  are  called  “mathematical”  because  they  are 
based  on  mathematical  analysis,  with  known  properties  of  accuracy  and  conver¬ 
gence,  in  contrast  to  those  to  be  considered  later. 

Suppose  n(f)  is  the  standard  normal  density 

n(f)  =  e-^Vv^  (8) 

and  let  /(x)  be  the  (continuous)  density  of  X.  Then  it  is  (nearly  always)  possible 
to  expand  f(x)  as 

f(x)  =  n(x)  jl  +  ^(/r2  -  l)Hi(x)  +  ^fi3H3(x)  +  -^(am  -  6/i2  +  3 )HA(x)  +  . . 

(9) 

called  a  Gram-Charlier  series.  The  Hr(x)  are  Hermite  polynomials.  Lists  of 
Hermite  polynomials,  and  also  conditions  for  convergence,  etc.,  are  given  in 
Kendall  and  Stuart  (1977,  vol.  1). 

The  basic  technique  involved  in  deriving  (9)  rests  on  the  fact  that  Hermite 
polynomials  are  orthogonal  with  respect  to  the  kernel  n(x);  thus 

//,(*)  Hj(x)  n(x)  dx  =  |  (10) 

Then  if  /(x)  =  Cjn(x)//j(x),  multiplication  by  Hj(x)  on  both  sides,  and 

integration,  gives 

cj  =  /  f{x)  Hj(x)  dx/j\ 

J  —  OO 

.  When  worked  out,  e2  =  (p2  -  1)/2,C3  =  n3/§,  etc. 

If  an  infinite  set  of  moments  is  available,  as  for  S,  the  density  can  be  ap¬ 
proximated  very  accurately  using  a  Gram-Charlier  series  of  sufficient  length,  but 
there  are  many  statistics  in  practical  applications  for  which  it  is  difficult  even 
to  get  the  first  four  moments  —  see  Solomon  and  Stephens  (1977)  for  examples. 
There  are  two  other  important  drawbacks: 
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1.  A  Jfc-term  fit  might,  at  any  one  value  of  x,  be  worse  than  a  (k  —  l)-term 
fit. 

7,  Gram-Charlier  series  with  finite  numbers  of  moments  car.  give  a  negative 
density  /(x),  particularly  in  the  tails. 

3.1  Percentage  points  approximation 

A  Gram-Charlier-type  expansion  can  also  be  found  for  F(x),  the  distribution 
function  of  X ;  this  can  be  inverted  to  give  a  percentage  point  for  a  given  cumu¬ 
lative  area  a.  Thus  suppose  F(x0)  =  a;  we  want  an  approximation  to  xa.  A 
Cornish-Fisher  expansion  gives  x  —  £  as  a  series  in  Hermite  polynomials  in 
x,  or  (more  practically  useful)  in  £,  where  £  is  the  percentile  corresponding  to 
or  for  the  normal  distribution,  that  is,  £  is  the  solution  of 


Again,  problems  can  arise  with  the  convergence  to  the  desired  xtt.  For  more 
details  on  mathematical  expansions  of  Gram-Charlier  or  Cornish-Fisher  type, 
see  Kendall  and  Stuart  (1977,  vol.  1). 


4  Pearson  curves  and  other  systems 


We  now  turn  to  a  method  of  approximation  which  can  be  thought  of  as  “laying 
one  curve  upon  another”  —  the  approximating  curve  has  parameters  which  can 
be  varied  to  make  a  good  fit.  The  parameters  are  usually  chosen  by  matching 
moments  or  cumulants.  Percentage  points  of  the  approximating  curve,  which 
are  tabulated  or  otherwise  easily  found,  are  then  used  as  approximations  to  the 
desired  points. 

A  family  of  approximating  curves  is  the  Pearson  system,  where  the  (contin¬ 
uous)  density  /(x)  is  approximated  by  /*(x),  given  by 

i  <*/*(»)  _  <*  +  x  f]2x 

/*(x)  dx  b0  +  b\x  +  62x2 

According  to  the  values  of  the  constants  a,bo,b\,b2,  integration  of  the  right- 
hand  side  will  take  many  forms,  giving  great  flexibility  to  the  system  of  densities 
/*(*).  With  considerable  algebra  (see  Elderton  and  Johnson,  1969,  for  details), 
the  constants  may  be  put  in  terms  of  the  moments: 


Suppose  A 
a 


10p4."2  —  18/4  —  12/4;  l;lien 

£3pf4  +  3/4) 

A 


(13) 

(14) 
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60  = 

-/i2(4/i2^4  -  3/25) 

A 

(15) 

6l  =: 

-a; 

(16) 

-(2/i2/i4  -  3 -  1*2/23) 

(17) 

°2  — 

A 

Thus  knowledge  of  the  first  four  moments  or  cumulants  of  X  will  fix  the  con¬ 
stants  above:  a  further  constant  C  enters  on  integrating,  but  is  fixed  by  the  fact 
that  the  total  integral  of  f'(x)  must  be  1. 

4.1  Percentage  points 

When  the  constants  are  known,  the  density  /*(x)  may  be  integrated  and  per¬ 
centage  points  solved  for  numerically.  Over  the  years,  this  was  done,  at  first 
very  laboriously,  for  a  small  range  of  possibilities,  but  a  quite  extensive  tab¬ 
ulation  was  made,  using  electronic  computers,  in  the  late  ’60s.  These  tables 
are  in  Biomeirika  Tables  for  Statisticians,  vol.  II.  The  form  of  the  tables  is 
as  follows.  The  percentage  points  for  X,  the  standardised  JA-variable  given  by 
X  —  (x  —  /i)/<r,  are  plotted  in  a  two-way  table  indexed  by  the  skewness  and 
kurtosis  parameters  0i  and  02-  These  are  defined  by 

A  =  TtI  and  /?2  =  (18) 

they  have  been  defined  to  be  scale-free,  and  \f0[  takes  the  sign  of  /23.  Pi 
measures  skewness:  a  large  (positive;  \ffil  means  the  curve  is  skewed  towards 
positive  values  (long  tail  is  to  the  right)  and  vice  versa  for  negative  v/®T-  A 
large  02  (always  positive)  means  the  density  has  heavy  tails.  Of  course,  all 
symmetric  distributions  have  0\  =  0;  a  benchmark  to  measure  kurtosis  is  the 
normal  distribution  for  which  02  =  3.  Since  K4  =  p4  —  3/22,  the  parameter 
72  =  02  —  3  =  K4//C2  can  also  be  regarded  as  measuring  kurtosis,  with  value 
72  =  0  for  the  normal  distribution. 

Suppose,  for  a  given  5,  we  have  y/fii  —  0.8  and  02  =  4.6.  To  use  Biometrika 
Tables,  one  enters  the  appropriate  y/fii  table,  \f0[  —  0.8,  and  travels  down 
the  left-hand  column  until  the  02  value,  4.6,  is  reached.  Along  the  row  are  17 
tabulated  percentage  points  for  X,  from  a  =  0.00  to  a  =  1.00.  Interpolation 
must  be  used  for  \f0i,02  values  not  explicitly  given. 

4.2  Un  peu  d’histoire 

At  this  point,  perhaps,  it  might  be  permitted  to  enliven  the  account  with  what 
the  Guide  Michelin  calls  un  peu  d’histoire.  At  the  time  Biomeirika  Tables  Vol. 
II  were  being  prepared,  I  was  fortunate  enough  to  know  Professor  E.  S.  Pear¬ 
son,  then  retired  but  still  very  active,  especially  as  Editor  of  Biometrika.  He 
had  collaborated  with  workers  in  the  U.  S.  to  get  the  tables  (Johnson,  Nixon, 
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Amos  and  Pearson,  1963)  and  had  carefully  compiled  the  full  set  by  hand.  He 
had  introduced  me  to  Pearson  curves,  which,  to  put  it  mildly,  did  not  figure 
prominently  in  statistical  training  of  the  day,  and  had  shown  me  how  effective 
they  could  be.  He  gave  me  a  copy  of  the  tables  to  use.  I  undertook  to  write 
a  Fortran  program  on  the  IBM  650,  to  interpolate  and  find  points,  given  the 
first  four  moments.  All  20  tables  were  then  typed  onto  punched  cards;  in  the 
end,  I  got  it  down  to  approximately  45  minutes  per  table.  This  is  not  such  a 
dramatic  piece  of  history  as  Michelin  usually  provides  (assignations  and  assas¬ 
sinations  often  play  a  prominent  role),  but  a  diminishing  generation  of  modern 
readers  will  still  empathise  with  the  fears  of  losing  the  boxes  of  cards,  getting 
them  wet  in  the  snows  of  Montreal,  etc.,  not  to  mention  the  awful  discovery  of 
a  wrongly-typed  number! 

Since  then,  programs  have  been  written  to  integrate  the  density  equation 
for  /'(x)  numerically  and  to  solve  for  xa  for  given  rv,  or  to  provide  the  tail 
area  for  given  x;  one  of  these,  kindly  given  to  me  by  Amos  and  Daniel  (1971), 
has  been  added  to  my  program;  this  greatly  increases  the  range  of  and  /?2 
for  which  Pearson  curve  approximations  can  be  found.  However,  points  are 
still  output  from  both  the  Amos  and  Daniel  part  of  the  program  and  by  the 
Biomeirtka  Tables  part,  ostensibly  as  a  check  where  available,  but  truthfully  as 
a  sentimental  tribute  to  E.  S.  P. 

Later  on,  Charles  Davis  and  I  (Davis  and  Stephens,  1983)  added  to  the 
program  to  enable  a  fit  to  be  made  using  knowledge  of  an  end  point  (for  example, 
that  the  left-hand  endpoint  of  S  is  zero)  and  three  moments.  This  is  especially 
valuable  for  the  type  of  statistic  for  which  each  successive  moment  requires 
exponentially  increasing  hard  work  —  for  example,  the  distribution  of  areas,  or 
perimeters,  of  polygons  formed  by  randomly  dropping  lines  on  a  plane  —  see 
Solomon  and  Stephens  (1977).  The  Pearson-curve  fitting  program  is  available 
from  the  author. 

Further  developments  have  included  algorithms  to  facilitate  use  of  Pearson 
curves  —  see,  for  example,  Bowman  and  Shenton  (1979a,  1979b). 

4.3  Accuracy  of  Pearson  curve  fits 

(a)  Pearson  curve  densities  are  unimodal,  or  possibly  J-  or  U-shaped,  but  never 

multimodal.  They  are  also  never  negative. 

(b)  Percentage  points  or  tail  areas  found  from  Pearson  curve  fitting  have  been 

found,  for  unimodal  long-tailed  distributions,  to  be  very  accurate  in  the 
long  tail,  at  least  for  tail  areas  bigger  then  0.005,  or  the  0.5%  point. 
Pearson  and  Tukey  (1965)  discuss  this  issue;  Solomon  and  Stephens  (1977) 
give  comparisons.  (In  making  comparisons,  one  must  of  course  compare 
the  Pearson  curve  fit  with  the  correct  xa,  or  the  correct  area  for  given  x, 
for  a  distribution  which  is  not  itself  a.  member  of  the  Pearson  family.) 
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(c)  Davis  (1975)  has  made  extensive  comparisons  with  Gram-Charlier  fits  using 
only  four  moments.  Pearson  curve  fits  are  better  than  Gram-Charlier  fits 
everywhere  except  for  distributions  very  close  to  the  normal,  as  measured 
by  the  j3\ ,  /?2  values. 

4.4  Other  systems 

Johnson  (1949)  has  proposed  another  family  (divided  into  three  parts)  of  curves 
defined  by  four  moments:  for  example,  the  Su  curves  are  those  given  by  the 
relation 

C  =  7  +  6sinh_1X  (19) 

where  X  =  (x  —  n)/cr,  and  7 ,6  are  to  be  chosen  to  make  the  distribution  of 
£  as  close  as  possible  to  N( 0, 1).  A  discussion,  and  tables  to  facilitate  the 
calculation  of  7  and  <5,  are  in  Biometrika  Tables  for  Statisticians  Vol.  II.  Other 
authors  have  also  proposed  families  of  distributions,  but  they  have  not  come 
into  such  common  use  for  the  purpose  of  approximating  percentage  points. 

5  Use  of  higher  moments 

We  now  turn  to  the  first  of  two  interesting  questions  —  can  higher  moments 
be  used  to  improve  the  accuracy  of  Pearson  curve  fits  in  the  long  tail  of  the 
distribution?  The  long  tail  will  be  supposed  to  lie  to  the  right,  as  for  the 
distribution  of  5;  then,  since  higher  values  of  x  will  contribute  more  to  the 
higher  moments  than  smaller  values,  we  might  suppose  that  fits  using  higher 
moments  will  improve  accuracy.  Unfortunately  it  is  not  easy  to  establish  the 
four  constants  in  terms  of  higher  moments  —  of  course,  only  four  of  these  would 
be  needed  to  fix  the  constants.  A  recursion  formula  exists  to  generate  higher 
moments,  for  r  —  2, 3, . . .: 

r6o/i'_!  -f  {(r  +  l)6i  +  a]ft'r  +  {(r  +  2  )&2  +  l}ft’T+l  =  0  (20) 

Iri  this  recursion,  the  constants  a,  bo,b\  and  62  occur,  and  this  means  that  one 
cannot  reverse  the  recursion  and  generate  ,  say,  ft  and  a2  from  ftz,ft^,ft^  and  ;<6. 

Nevertheless,  one  can  generate  the  fifth  and  sixth  moments  of  the  Pearson 
curve  with  the  same  first  four  moments  of,  say,  S ,  and  compare  them  with  the 
true  fifth  and  sixth  moments  of  5.  The  first  two  moments  are  then  slightly 
changed,  and  the  procedure  successively  repeated,  until  the  third,  fourth,  fifth 
and  sixth  moments  of  each  curve  match.  This  will  mean  that  the  mean  and 
variance  of  the  Pearson  curve  will  not  be  exactly  the  same  as  those  for  S, 
although  they  will  be  close,  and  this  will  probably  make  a  worse  fit  in  the  lower 
tail;  but  for  higher  x  the  fit  could  improve.  I  have  made  some  comparisons  using 
this  procedure,  but,  as  one  might  expect,  there  appears  to  be  no  systematic 
improvement.  In  discussion,  when  this  paper  was  first  presented,  the  suggestion 


224 


was  made  to  use  Least  Squares  to  make  ‘‘closest”  fits,  in  order  to  compare  the 
six  moments.  More  work  is  needed  to  compare  Pearson  curve  fits  along  these 
various  lines,  but  it  is  not  likely  that  the  improvement  will  be  sure,  or  will 
extend  to  points  far  into  the  tails.  In  the  end  it  must  be  remembered  that  one 
curve  is  simply  being  laid  on  top  of  another,  with  only  four  parameters  to  vary, 
and  there  is  no  mathematical  analysis  that  will  guarantee  accuracy. 

Other  methods  for  developing  accuracy  in  the  extreme  tails  include  numerical 
inversion  of  the  Characteristic  Function  (essentially  the  MGF  with  it  replacing  t, 
where  i  —  y/—\),  or  saddlepoint  approximations.  A  method  due  to  Imhof  (1961) 
uses  numerical  inversion  for  distributions  such  as  S,  but  the  computer  time 
needed  increases  rapidly  as  the  distance  into  the  tails  increases  (to  give  small 
tail  areas).  Field  (1992)  has  recently  examined  saddle-point  approximations  for 
S.  These  would  seem  to  give  more  promise  of  tail-end  accuracy  in  the  long  run. 

6  Use  of  sample  moments 

The  second  interesting  question  is:  how  accurate  are  Pearson  curve  fits  when 
sample  moments  are  used  to  make  the  fit?  In  the  earliest  days,  this  was  the  use 
to  which  Pearson  curves  were  applied  —  to  find  a  smooth  density  to  describe 
a  set  of  data,  such  as  lengths  of  beans,  or  width  of  skulls.  Kendall  and  Stuart 
(1977,  Vol.  1  )  gives  details  of  such  a  fit.  In  general,  the  Pearson  curves  will  give 
very  good  fits  to  a  unimodal  set  of  data,  or  even  to  J-shaped  or  U-shaped  sets, 
but  it  is  important  to  assess  the  accuracy  of  extrapolation  from  the  sample  to 
the  supposed  population  from  which  it  came.  More  precisely,  we  ask  how  close 
the  sample  fit  estimate  of,  say,  the  upper-tail  5%  point  is  to  the  true  population 
5%  point,  and,  further,  whether  or  not  the  Pearson-curve  point  is  better  than 
the  estimated  point  derived  from  choosing  the  appropriate  order  statistic  —  in  a 
sample  of  1000,  the  951st  value  in  ascending  order,  or  in  a  sample  of  size  10000, 
the  950 1st  value.  Some  investigation  of  these  questions  has  been  undertaken  in 
two  quite  different  ways,  by  Johnstone  (1988)  and  by  myself  (Stephens,  1991). 

The  accuracy  of  the  Pearson  curve  point  will  depend  on: 

1.  the  sample  size  n, 

2.  the  a-level  (tail  area)  of  the  point  required, 

3.  the  true  skewness  and  kurtosis  of  the  density  approximated, 

4.  higher  moments. 

Johnstone  gives  a  small  study,  for  samples  from  populations  with  the  following 
range  of  parameters: 


Pi 

0.0 

0.0 

1.0 

1.0 

2.0 

Pi 

3.3 

4.0 

5.25 

6.0 

7.5 
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Johnstone  gives  plots  of  the  estimated  coefficient  of  variation,  CV,  of  the 
Pearson  curve  xQ  against  —  logo  ,  where  the  base  of  logarithms  is  10.  Thus  the 
CV  of  the  estimated  xo  01  is  plotted  against  2,  that  of  the  estimated  xo.ooi  is 
plotted  against  3,  etc  .  The  coefficient  of  variation  is  estimated  using  a  Taylor 
series  approximation.  As  one  might  expect,  the  CV  goes  up  markedly  as  a  gets 
smaller  (so  —  logo  gets  larger  on  the  x-axis),  and  the  steepness  of  the  rise  is 
greater  for  the  more  skew  distributions  . 

In  Stephens  (1991),  Monte  Carlo  samples  were  taken  from  populations  for 
which  exact  percentage  points  could  be  found,  and  the  exact  points  were  com¬ 
pared  with  those  obtained  from  (a)  Pearson  curve  fits  using  the  moments  of 
each  sample,  and  (b)  the  order  statistic  estimate  from  each  sample.  The  order 
statistic  estimate  will  be  asymptotically  unbiased,  while  one  can  say  nothing 
exact  about  the  point  obtained  by  laying  one  curve  on  another;  recall  that  sam¬ 
ple  moments,  especially  the  third  and  fourth,  arc  extremely  variable,  even  for 
quite  large  samples.  The  results  showed,  as  expected,  that  the  Pearson  curve 
points  were  more  biased.  However,  somewhat  surprisingly,  they  had  smaller 
mean  square  error.  Therefore,  it  might  well  be  preferable  to  use  the  Pearson 
curve  points,  although,  again,  more  investigations  should  be  made  especially  if 
the  points  required  are  far  into  the  tail. 


7  Goodness  of  fit  using  moments 

In  this  second  part  of  the  paper,  we  discuss  how  moments  arc  used  in  Goodness- 
of-Fit,  that  is,  to  test  whether  a  random  sample  comes  from  a  given  (continuous) 
distribution.  The  distribution  will  often  have  unknown  parameters,  which  must 
be  estimated  from  the  given  sample. 


7.1  Tests  based  on  skewness  and  kurtosis 

Suppose  the  r-th  sample  moment  mr  about  the  mean  is  defined  by 

mr  =  -*)r-  021) 
1  =  1 


The  sample  skewness  and  sample  kurtosis  are  then  defined  by 


9 


These  statistics  are  not  unbiased  estimates  of  and  /?2,  but  they  are  consistent, 
that  is,  the  bias  diminishes  with  increasing  sample  size.  The  sample  skewness 
and  kurtosis  are  time-honoured  statistics  for  testing  normality,  having  been  used 
in  a  rather  ad  hoc  manner  for  most  of  this  century;  b\  is  compared  with  zero, 
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and  bo  with  3,  the  value  of  po  for  the  normal  distribution.  However,  distribu¬ 
tion  theory  of  b i  and  bo  is  difficult,  and  it  is  only  since  computers  have  been 
available  that  extensive  and  reliable  tables  of  significance  points  have  existed  for 
these  statistics.  Further,  61  and  6;  can  be  combined  to  give  one  overall  statistic 
(d’Agostino  and  Pearson,  1973,  1974;  d’Agostino,  1986).  For  other  distributions 
Bowman  and  Shenton  (1986)  have  also  given  tables  for  these  statistics.  Stud¬ 
ies  have  shown  that  skewness  and  kurtosis,  especially  combined,  provide  good 
omnibus  tests  for  normality,  although  less  is  known  for  other  distributions.  For 
the  important  discrete  distribution,  the  Poisson,  all  cumulants  are  equal  to  the 
mean,  denoted  by  the  parameter  A;  a  time-honoured  test  for  the  Poisson  is 
based  on  the  ratio  of  sample  variance  to  sample  mean,  which  of  course  should 
be  about  one.  Again,  this  simple  statistic  appears  to  compete  well  with  others 
in  terms  of  power. 

7.2  A  formal  technique  based  on  moments 

Perhaps  because  of  the  variability  of  sample  moments,  which  makes  calculation 
of  significance  points  difficult  for  statistics  based  on  these  moments  when  calcu¬ 
late^  from  samples  of  reasonable  size,  it  took  some  time  to  formalize  a  technique 
based  on  moments.  Gurland  and  Dahiya  (1970)  and  Dahiya  and  Gurland  (1972) 
have  however  devised  a  general  procedure.  The  essential  steps  are  as  follows: 

1.  A  vector  (  of  length  s,  say,  must  be  found,  whose  components  Q  are  func¬ 
tions  of  the  theoretical  moments,  and  such  that  each  component  is  linear 
in  the  parameters.  (This  might  involve  re-parametrising  the  distribution 
from  its  usual  form). 

2.  The  estimate  h  of  £  is  obtained  by  replacing  theoretical  moments  by  sam¬ 
ple  moments. 

3.  The  test  statistic  is  then  based  on  the  difference  h  —  £. 

Suppose  that  E  is  the  covariance  matrix  of  h,  0  is  the  g-vector  of  unknown 
parameters,  and  W  is  the  s  x  q  matrix  such  that  £  =  WO.  Then  define 

Qt  =  n{h-weyt-\h-we), 

where  0  —  (FF'E-1  W)~l  W‘E~lh.  The  statistic  9  is  (he  regression  estimate  of 
9  obtained  by  generalized  least  squares,  and  E  is  E  with  the  estimate  0  used 
wherever  9  appears. 

Gurland  and  Dahiya  (1970,  1972)  showed  that,  asymptotically,  the  test 
statistic  Qt  has  the  x~  distribution  wit  h  t  —  s  —  q  degrees  of  freedom.  Currie  and 
Stephens  (1986,  1990)  have  studied  the  procedure,  and  show  several  properties 
of  Qt.  Among  these  are  the  fact  that  the  test  statistic  Qt  can  be  broken  into 
t  components,  each  with  asymptotic  distribution  \'j,  and  each  testing  different 
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features  of  t  he  distribution.  Each  component  is  a  function  of  moments  or  cumu- 
1, nits  For  example,  consider  the  test  for  normality,  that  is,  for  Lhe  distribution 
A'(,'/,rr2).  Gurland  and  Dahiva  (1970)  took  ('  =  (/i,  log/j2, /r3,  log(/j^/3)},  so 


that  h'  —  {x,logm2,rn3tlog(m4/,‘l)}.  The  matrix  W  is  W  = 


1  0 
0  1 
0  0 
0  2 


and 


0  = 


log  <y2 

are  ci  =  nmj/Gm^  and 


.  The  test  statistic  Qi  becomes  ci+c-j,  where  the  two  components 

(3n/B){!og(m4/3m;)}.  Thus  the  method  leads  to 
nb i/6  and  (3n/8)  log(62/3)  as  test  statistics,  equivalent  to  the  “old-fashioned” 
6i  and  62. 

However,  it  should  be  noted  that  the  components  are  not  unique;  they  de¬ 
pend  on  how  Q  is  formed.  Currie  and  Stephens  (1986,  1990)  discuss  these 
questions  in  some  detail. 


8  Components  of  other  goodness-of-fit  statis¬ 
tics 

Other  goodness-of-fit  statistics  also  have  components  which  are  functions  of 
moments.  The  oldest  of  these  was  proposed  by  Neyman  (1937),  in  connection 
with  a  test  for  uniformity. 

A  test  for  a  fully  specified  continuous  distribution  (that  is,  all  parameters 
known)  can  always  be  converted  to  a  test  for  uniformity  by  means  of  the  Prob¬ 
ability  Integral  Transformation,  and  a  test  for  the  exponential  distribution  can 
also  be  so  converted,  even  when  the  scale  and  origin  parameters  are  not  known, 
so  that  Neyman ’s  test  has  wider  applicability  than  it  might  at  first  appear.  (For 
details  of  these  transformations,  see  Stephens,  1986a,  1986b). 

Neym.m’s  test  is  as  follows:  suppose  the  test  is  that  Z  has  a  uniform  distri¬ 
bution  between  0  and  1.  written  (7(0,1).  On  the  alternative,  let  the  logarithm 
of  the  density  of  Z  be  expanded  as  a  series  of  Legendre  polynomials: 

l°s(/(~))  —  ^(C){1  +  Ci//i(z)  +  c2L2(~)  +  c3L3(r)  +  •••},  (23) 

where  the  c,  are  coefficients,  components  of  the  vector  c,  £,,•(*)  is  the  i-th 
Legendre  polynomial,  and  A(c)  is  a  normalising  constant. 

A  test  for  uniformity  is  then  a  test  that  all  c,-  =  0.  The  estimates  of  c,  are 

n 

*  =  £L*-(zi)  (2d) 

j=i 

where  Z\ ,  z2, . . . ,  zn  is  the  given  sample. 
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The  first  few  Legendre  polynomials  are  best  expressed  in  terms  of  y  —  z  —  0.5. 
Then 


Lx(z)  =  2  V%j,  (25) 

L-.(z)  =  v^(6./2  -  0.5),  (26) 

La(-)  =  V/7(20y3  —  3y),  (27) 

so  that  the  estimate  ci  becomes  a  function  of  the  first  moment  about  the  known 
mean  0.5,  the  second  estimate  f?  becomes  a  function  of  the  second  rnomc.it,  c 3 
a  function  of  both  the  third  and  the  first  moments,  etc. 

Neyman  shows  that  the  suitably  normalised  cy  have  asymptotic  N( 0,  1)  dis¬ 
tributions,  and  his  overall  test  statistic  is  the  sum  or  the  squares  of  these  nor¬ 
malised  estimates.  Thus  the  overall  statistic  has  an  asymptotic  y2  distribution, 
just  as  for  the  Dahiya-Gurland  statistic,  and  the  individual  terms,  based  on 
moments,  are  the  components  of  the  overall  test  statistic. 


9  EDF  statistics 

Another  important  family  of  goodness-of-fit  statistics  is  that  derived  from  the 
Empirical  Distribution  Function  (FDF)  of  the  ^-sample.  This  family  includts 
the  well-known  Kolmogorov-Smirnov  statistic  and  the  Cramer- von  Mises  family 
of  statistics  (for  details  and  tests  for  many  distributions  based  on  these,  see 
Stephens,  1986a). 

One  of  the  most  important  of  the  Cramer- von  Mises  class  is  A~ ,  introduced 
by  Anderson  and  Darling  (1954).  The  definition  of  A2  is  based  on  an  integral 
involving  the  difference  between  the  EDF  and  the  tested  distribution  F(i)  (with 
parameters  estimated  if  necessary).  The  working  formula  is 

A-  =  -n  -  ^  -  1)  [log  Z(ij  +  log(l  -  Sfn+i-i))]  ,  (28) 

t 

where  z,  =  F( jry),  and  cy,)  are  the  order  statistics. 

As  an  omnibus  test  statistic,  A2  has  been  shown  to  perform  well  in  many 
test  situations. 

Anderson  and  Darling  showed  that  the  asymptotic  distribution  of  A2  is, 
like  S  of  Section  1,  a  sum  of  weighted  variables.  The  individual  terms 
in  the  sum  can  again  be  regarded  as  components  of  the  entire  statistic,  and 
Stephens  ( 1974)  has  investigated  these  components  in  some  detail  A  remarkable 
result  is  that  they  too  are  based  on  Legendre  polynomials,  so  that  they  are 
effectively  the  same  as  the  Neyman  components,  based  on  moments  of  the  z- 
sample.  There  has  been  some  investigation  of  components  of  these  and  other 
statistics,  as  individual  test  statistics  for  the  distribution  under  test;  references 
are  given  by  Stephens(1986a).  As  for  the  Gurland-Dahiya components,  they  can 
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be  expected  to  tie  sensitive  to  different  departures  from  the  tested  distribution. 
The  complete  test  statistics  of  Neynian  and  of  Anderson-Darling  combine  the 
same  components,  but  with  different  weightings. 
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Abstract 

Higher-order  statistics  (1IOS)  are  now  very  widely  used.  Two  areas  where  they 
begin  receiving  considerable  attention  are  array  and  speech  processing.  Tins  paper 
describes  some  recent  applications  of  1 1 OS  in  both  areas  by  the  authors  [19]- [20]. 

In  our  speech  processing  application,  we  demonstrate  a  way  to  better  discriminate 
between  voiced  and  unvoiced  speech.  This  is  accomplished  by  observing  the  behavior 
of  a  cumulant-based  adaptive  filter,  and  makes  use  of  the  fact  i  hat  unvoiced  speech  is 
Gaussian,  whereas  voiced  speech  is  definitely  non-Gaussian.  We  have  also  shown  a  way 
to  utilize  the  prediction  residual  from  the  adaptive  filter  to  estimate  the  pitch  period 
for  voiced  speech. 

Array  processing  encompasses  a  multitude  of  problems,  including  beamforming 
and  direction-of-arrival  (DOA)  estimation.  We  have  developed  fourth-order  cumulant- 
based  blind  optimum  beamforming  algorithms  that  outperform  existing  methods.  The 
term  blind  indicates  that  our  methods  do  not  require  a  priori  knowledge  of  array  geom¬ 
etry  and  DOA,  nor  they  arc  affected  by  multipath  propagation  and  presence  of  smart 
jammers.  Extensive  simulations  support  our  theoretical  claims  on  the  optimality  of 
our  beam  forming  procedure. 


1  Introduction 

Our  work  on  speech  processing  describes  a  method  that  consists  of  an  adaptive  predictor,  a  voicing 
decision  (V'/EV).  and  a  pitch  period  estimator.  The  focus  of  this  study  is  on  robust  detection  of 
speech  state  and  estimation  of  pitch  period.  I  his  is  accomplished  by  observing  the  behavior  of  an 
adapt  ive  predict  or  which  processes  the  speech  signal.  Higher  order-  statistical  analvsis  is  proposed 
for  discrimination  of  speech  states.  Comparing  t  he  energy  of  t  he  original  speech  signal  with  that 
of  the  prediction-error  residual  yields  the  decision  method.  Hot  h  covariance  and  cumulaut -based 
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prediction  methods  arc  investigated  and  t  iie  latter  is  shown  to  he  a  more  robust  way  of  making 
(Y/I  Y)  decision.  Pitch  estimation  is  accomplished  |,v  using  correlation-based  approaches  that 
operate  on  the  energy  estimate  of  the  cumulant  based  prediction  residual  rattier  than  the  original 
speech  signal.  Pitch  estimation  by  our  method  yields  belter  performance  than  currently  existing 
batch  procedures. 

Array  processing  work,  as  described  in  this  paper,  addresses  the  problem  of  blind  optimum 
beamforming  for  a  non-Gaussian  desired  signal  in  the  presence  of  interference.  Sensor  response, 
location  uncertainty  and  use  of  sample  statistics  can  severely  degrade  the  performance  of  optimum 
beamformers.  In  this  paper,  we  propose  blind  estimation  of  the  source  steering  vector  in  the  pres¬ 
ence  of  multiple,  directional,  correlated  or  coherent  Gaussian  interferers  via  higher-order  statistics. 
In  this  way.  we  employ  the  statistical  charact crist ics  of  the  desired  signal  to  make  the  necessary  dis¬ 
crimination,  without  any  a-priori  knowledge  of  array  manifold  and  direction-of-arriva!  information 
about  the  desired  signal.  We  then  improve  our  method  to  utilize  the  data  itt  a  more  efficient  man¬ 
ner.  In  any  application,  only  sample  statistics  are  available,  so  we  propose  a  robust  beamforming 
approach  that  employs  the  steering  vector  estimate  obtained  by  cumulant-based  signal  processing. 
We  further  propose  a  method  that  employs  both  covariance  and  cumulant  information  to  combat 
finite  sample  effects.  We  analyze  the  effects  of  multipath  propagation  on  the  reception  of  the  desired 
signal.  We  show  that,  even  in  the  presence  of  coherence,  cumulant-based  beamforiner  still  behaves 
as  the  optimum  beamforiner  that  maximizes  the  Signal  to  Interference  plus  Noise  Ratio  (SIN'R). 
Finally,  we  propose  an  adaptive  version  of  our  algorithm.  Simulations  demonstrate  the  excellent 
performance  of  our  approach  in  a  wide  variety  of  situations. 

2  Cumulant-Based  Adaptive  Analysis  of  Speech  Sig¬ 
nals 

Voiced/Unvoiced  (V/l'V)  decision  is  an  important  problem  in  speech  processing.  Almost  all  speech 
coding,  recognition  and  speaker  identification  systems  require  this  information  for  an  effective 
processing  of  speech  data.  In  addition,  low-delay  speech  processing  systems  require  1  his  decision 
be  provided  in  real-time.  In  [2]  some  commonly  employed  features  are  described,  and  a  subset  of 
them  are  used  to  train  an  artificial  neural  network  to  perform  Y/I  Y  decision. 

fri  frame- based  analysis  of  speech  signals,  feature  extraction  is  performed  on  the  current  block 
of  data,  and  a  decision  is  given  at  the  end  of  the  period.  For  this  reason,  frame-based  methods 
are  incapable  of  tracking  rapid  changes  in  signal  characteristics.  Transitions  of  the  state  of  speech 
within  a  frame  period  affect  the  decisions  resulting  from  a  frame-based  analyzer.  In  general,  this 
mixed  state  of  speech  within  a  period  can  not  be  identified  and  incorrect  decisions  will  be  made. 
This  will  degrade  the  performance  of  t  he  overall  speech  processing  system.  In  addition,  frame-based 
analysis  introduces  delay,  which  may  not  be  tolerable  in  low-delay  systems. 

Severe  non-slationarit  v  observed  in  speech  signals  and  low-delay  requirements  of  the  contem¬ 
porary  speech  processing  systems  motivate  the  use  of  adaptive  algorithms  for  feature  extraction 
in  place  of  their  hatch  counterparts.  In  general,  adaptive  processing  techniques  are  designed  to 
minimize  some  least-squares  error  criterion.  Their  use  is  motivated  by  the  assumption  that  the 
processes  are  Gaussian  and  t  lie  performance  analysis  is  tractable  with  this  assumption  [,'!j:  how¬ 
ever.  this  approach  ignores  the  non-Gaussian  mil  ure  of  t  lie  underlyin':  signal. 

Adapt  ive  predict  ion  of  the  incoming  signal  and  coni  i  inn  ms  mnuit  nriug  of  predict  a  m  error  p<  iv.  ei 
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Figure  1:  Typical  speech  signals:  (a)  I  nvoiceu  speech.  (!>)  Voiced  speech. 


makes  delecting  changes  in  the  spectra!  characteristics  of  tin  process  possible.  Wo  may  consider 
such  a  change  as  an  <rcnt.  After  an  event,  an  adaptive  unit  will  require  a  period  to  adjust  itself 
for  the  new  configuration.  During  this  learning  period,  prediction  error  power  will  temporarily 
increase.  This  observation  was  used  in  to  detect  abrupt  changes  in  the  autoregressive  I  AIT) 
parameters  of  a  linear  process.  If  a  lattice  form  is  nr“d  rather  than  a  finite  impulse  response  (FIR) 
filter,  reflection  coefficients  will  be  available  for  monitoring  purposes.  In  addition,  adaptive  lattice 
filters  exhibit  better  learning  characteristics  than  their  1IR  counterparts.  This  may  improve  the 
ability  to  localize  the  event  when  prediction  error  power  is  monitored. 

In  this  study,  we  shall  investigate  the  application  of  adaptive  prediction  methods  to  detect 
V/FV  transitions  in  speech  signals:  hence,  events  of  interest  will  he  V/I’V  or  V\'/\  transitions. 
Our  approach  will  take  the  speech  production  model  into  account  and  utilize  higher  than  second- 
order  statistics  of  speech  signals. 

2.1  Speech  Production  Model 

The  state  of  speech  signal  belongs  to  thiee  categories:  voiced,  unvoiced  and  silence.  Silent  periods 
can  be  detected  easily  by  monitoring  zero  crossing  rale  and  energy  ol  the  received  signals  WP.  lot 
this  reason,  we  shall  concentrate  on  voiced/unvoiced  classification  of  speech. 

Fit  voiced  sounds  are  generated  by  forming  a  const  net  ion  at  some  point  in  the  vocal  tract 
and  forcing  air  through  the  constriction  at  a  high  velocity  to  produce  turbulence.  lhF  creates 
a  broad  spectrum  noise  source  t<>  excite  the  vocal  tract.  I  he  energy  concentration  is  shifted  to 
the  high-frequency  end  of  I  lie  spectrum  for  unvoiced  sounds.  Imt  the  sped  rum  is  relatively  11.it 
wluu)  compared  with  that  of  voiced  speech.  Due  to  large  number  of  random  « ■  f i < - ( t involved  in  the 
production  of  unvoiced  speech,  (laussiun  noise  is  a  valid  candidate  as  the  excitation  -ouree.  I  lits 
assumption  is  validated  by  Wells  In  his  work,  t  lie  Inspect  rum  is  used  to  make  \  <1  \  decision. 

It  has  been  found  that  Inspect  rum  ol  laigli  li  Iricative-.  tend  In  /cm.  but  (or  vow. -is  ;  (w  sjt  uni  ion  is 
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Figure  2:  Adjacent  sample  correlation  of  speech  signals. 


just  the  opposite.  A  typical  unvoiced  segment  of  speech  is  shown  in  Fig.  la. 

Voiced  sounds  are  produced  by  forcing  air  through  the  glottis  with  the  tension  of  the  vocal 
cords  adjusted  so  that  they  vibrate  in  a  relaxation  oscillation,  thereby  producing  quasi  periodic 
pulses  of  air  which  excite  the  vocal  tract.  This  excitation  is  dearly  tion-Caussiati.  The  en>-rgy 
concentration  is  in  the  low-frequency  side  of  the  spectrum  in  the  form  of  a  fundamental  component 
and  its  harmonics.  In  addition,  voiced  sounds  have  more  energy  than  unvoiced  sounds.  A  typical 
voiced  speech  segment  is  shown  in  Fig.  lb. 

For  voiced  sounds,  the  vocal  tract  can  be  modelled  as  an  all- pole  linear  system.  The  same  model 
also  holds  for  unvoiced  sounds  but  the  AH  order  is  less.  Correlation  between  adjacent  samples  is 
high  for  voiced  sounds.  On  the  other  hand,  unvoiced  speech  resembles  white  noise  since  its  spectrum 
is  relatively  flat,  yielding  small  correlation  between  adjacent  samples.  Correlation  sequences  for 
voiced  and  unvoiced  cases  are  illustrated  in  Fig.  2. 

The  differences  in  the  excitation  and  correlation  properties  |or  these  two  eases  can  be  use,] 
discriminate  between  them;  however,  with  second-order  statistics  we  can  only  use  the  correlation 
properties  but  ran  not  utilize  the  information  about  the  excitation  model.  Ibis  mot'vates  the  use 
of  higher-order  cumulants  of  speech  signals. 


2.2  Our  Approach 

In  the  previous  section,  we  mentioned  t  he  distinct  ions  between  voiced  ami  unxoived  -minds:  cone 
la  lion  among  adjacent  sample*  and  excitation  models.  In  this  section.  v.e  shall  invest  mate  met  |h 
that  fully  utilize  this  information. 

Linear  predict  ion  ([.!’)  met !  mis  are  employed  to  accomplish  otic  goal:  however,  we  *  ha  I!  not  use 
batch-type  methods  for  reasons  out  lined  previously.  Linear  prediction  (an  he  based  on  se,  om!  or 
higher-order  statistics,  however  the  former  is  usually  employed,  linear  preu'u  t  in.  i-  es-e-|tt.,l|-. 
identifying  the  inverse  ol  a  linear  s\ stem  driven  bv  white  not  .<■:  hence,  tt  can  oe  ■  ■  > , ; - 1 < I <  ; ( )  a-  a 
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system  identification  problem.  The  system  under  consideration  can  be  approximated  by  an  AH 
model,  so  an  Fill  prediction  filter  will  whiten  the  spectrum  of  the  incoming  signal.  We  shall 
investigate  the  differences  between  cumulant-and  covariance- based  adaptive  prediction  methods  in 
this  section. 

2.2.1  Second-order  statistics  based  adaptive  filtering 

Correlation-based  adaptive  prediction  filters  tend  to  minimize  the  prediction  error  power  at  the 
output  of  the  filter.  Since  correlation  among  adjacent  samples  is  high  for  voiced  signals,  we  can 
remove  a  large  proportion  of  energy  from  the  original  speech  signal  using  prediction.  On  the  other 
hand,  in  the  case  of  unvoiced  sounds,  LP  will  not  be  that  successful  due  to  small  correlation  among 
samples.  Therefore,  a  comparison  of  the  input  signal  power  with  the  power  in  the  prediction  residual 
will  reveal  the  state  of  the  speech  signal. 

Lattice  prediction  filters  enable  monitoring  the  variation  of  prediction  error  power  with  model 
order  due  to  their  specific  structure.  Aiitorogr'*“sivo  model-order-select  ion  can  be  performed  by 
selecting  the  tap  which  results  in  minimum  prediction-error  power.  This  leads  to  another  dis¬ 
crimination  between  voiced  and  unvoiced  sounds  ,  since  this  order  will  be  relatively  lower  for  the 
unvoiced  case. 

2.2.2  Fourth-order  statistics  based  adaptive  filtering 

In  this  section,  we  shall  investigate  the  behavior  of  a  fourth-order  nimulant-  bast'd  adaptive  filter. 
An  adaptive  algorithm  for  estimating  the  parameters  of  nonstationary  AH  processes,  excited  by 
non-Gaussian  signals  is  proposed  in  [<>•'>],  and  some  modifications  are  suggested  in  [22].  We  used 
the  method  of  [fid],  which  is  in  the  software  package  Hi  -  Sptc1  (trademark  of  United  Signals 
and  Systems,  Inc.)  [d.'i].  The  ideas  for  the  covariance- based  filter  directly  apply  to  ibis  case  with 
one  important  exception:  the  cumulant-based  adaptive  tiller  provides  the  solution  to  the  cumulaat- 
based  normal  equations,  and  this  solution  is  not  the  oik1  t  hat  minimizes  the  prediction-  error  power; 
however,  one  may  argue  that  if  the  speech  production  system  can  be  identified  accurately,  then  the 
prediction  error  should  be  close  to  the  minimum  possible  value. 

With  higher-order  statistics,  we  have  the  diversity  of  using  the  excitation  information:  for 
voiced  sounds  ,  the  excitation  is  non-Gaussian;  hence,  the  speech  production  mechanism  can  be 
identified  by  cumulant-based  AH  equations.  On  the  other  hand,  for  unvoiced  sounds  the  excitation 
is  Gaussian,  making  the  idc  ntijiration  problem  ill-post  <1*  The  eumulant-beised  adaptirt  Jitter  trill 
not  be  able  to  identify /  the  system  and,  since  Hurt  is  no  associated  out put-ptnrer  minimization 
criterion,  prediction-error  power  may  arbitrarily  increase.  In  this  case,  a  cumulant-based  filter  may 
even  amplify  the  speech  signal  making  the  power  reduction  by  prediction  comparison  more  dear 
than  when  using  a  covariance- based  method. 

To  valid  at*'  our  ideas  about  covariance  and  nimulant  based  adapt  ive  predict  ion  of  speech  signals, 
we  performed  some  experiments  using  data  from  the  TIM  IT'  speech  recognition  database,  l’lie 
results  verify  our  claims  and  are  provided  in  the  next  section. 

1  A  cumulant -based  filter  provides  t  lie  solul  ion  of  ni mill. ml  -  based  normal  e.piat  ions  in  an  adaptive  fashion, 
however,  this  set  of  equations  becomes  trivial  when  the  input  to  be  analyzed  is  a  Gaussian  linear  pro.-.  s>, 
because  higher  than  second-order  ciimiilaiils  of  Gaussian  processes  are  zero 
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2.3  Experiments 

We  start  our  experiments  by  investigating  the  prediction  performance  of  correlation-ami  cumulant- 
based  linear  predictors  in  voiced  speech  case.  An  indication  of  performance  is  the  energy  of 
prediction-error  residual  at  the  output  of  the  filter.  For  this  purpose,  we  selected  a  voiced  speech 
segment  from  the  TIMIT  database  and  performed  adaptive  filtering  based  on  both  correlation  and 
cumulants.  We  expected  that  the  correlation-based  filter  would  yield  bet  ter  performance,  since  it  is 
designed  to  minimize  prediction-error  power.  The  original  speech  signal  is  scaled  so  that  estimate 
of  its  variance  is  unity.  The  results  of  this  experiment  are  shown  in  Fig.  3.  Energy  values  reported 
in  this  figure  represent  the  estimate  of  the  variance  of  the  signal  averaged  over  the  data  window. 
Interestingly  enough,  the  cumulant-based  filter  performed  better  than  its  covariance  counterpart, 
although  the  latter  is  designed  to  minimize  the  power  of  the  prediction  residual.  We  repeated  this 
experiment  with  other  speech  segments  and  in  all  of  the  rases,  cumulant-based  filter  outperformed 
covariance-based  filter. 

In  voiced  speech,  a  conventional  system  identification  approach  for  estimating  the  AR  param¬ 
eters,  using  a  least-squares  fit  procedure,  suffers  due  to  the  nature  of  the  excitation  sequence.  It  is 
known  that,  for  voiced  speech,  the  source  is  definitely  non-Gaussian  ;  it  is  quasi-periodic  in  nature 
with  spiky  excitations.  The  impulsive  nature  of  the  excitation  in  voiced  speech  is  exploited  in  [40], 
by  making  a  Bernoulli-Gaussian  assumption  to  develop  a  multipulse  coding  scheme.  In  [30]  .  a 
robust  linear  prediction  algorithm  is  proposed  which  takes  into  account  the  non-Caussian  nature  of 
source  excitation  for  voiced  speech  bv  assuming  the  excitation  is  from  a  mixture  distribution,  such 
that  a  large  portion  of  the  excitation  sequence  is  from  a  normal  distribution  with  small  variance 
while  a  small  portion  comes  from  an  unknown  distribution  of  higher  variance.  Such  a  distribution 
is  called  heavy-tailed  Gaussian.  Based  on  the  above  mixture  model,  a  linear  prediction  algorithm 
is  devised  which  employs  robust  statistical  procedures  (  developed  in  [34]  )  that  operate  in  a  batch 
mode.  Although  satisfactory  performance  is  observed,  the  method  ran  not  track  the  transitions 
in  the  input  data.  This  points  out  a  very  important  fact  :  conventional  linear  prediction  ran  be 
unsatisfactory  due  to  incorrect  modelling  of  the  excitation.  Of  course,  this  carries  over  to  the 
adaptive  domain,  i.e.,  a  correlation -based  adaptive  algorithm  may  not.  be  able  to  yield  the  best 
possible  fit  in  the  presence  of  outliers  in  the  data.  On  the  other  hand,  a  non-Gaussian  excitation 
is  required  by  higher-order-statistics-based  identification  algorithms.  A  cumulant-based  adaptive 
filter  is  able  to  reduce  the  power  in  the  signal  by  effective  prediction,  although  it  is  not  based  on  a 
criterion  for  minimizing  the  power  of  prediction  residual.  Power  reduction  may  be  even  more  than 
that  provided  by  a  covariance- based  filter  due  to  the  just,  described  outlier  problem. 

To  analyze  the  behavior  of  adaptive  predictors  in  voiced  and  unvoiced  speech  states,  we  selected 
a  250  msec  period  of  speech  segment,  in  which  there  are  two  transitions:  voiced  (0-75  msec),  unvoiced 
(75-190  msec)  and  again  voiced  (190-250  msec).  This  signal  is  shown  in  Fig.  1. 

We  used  an  order  ten  predictor  for  adaptive  filtering  of  the  speech  waveform.  Figure  5  shows 
the  prediction-error  from  a  covariance- based  filter.  Observe  that  an  adaptive  filter  based  on  a 
power  minimization  criterion  will  turn  off  during  the  unvoiced  period;  hence,  this  segment  passes 
undisiorted  through  the  filter.  The  reason  for  this  (as  explained  previously)  is  the  small  adjacent- 
sample  correlation  for  unvoiced  sounds  which  makes  the  process  unpredictable.  To  minimize  the 
output  power. the  filter  turns  off:  however,  during  voiced  segments  deconvolution  is  successful.  We 
observe  a  quasi-periodic  pulse  train  for  the  prediction  residual,  which  is  in  accordance  with  the 
excitation  model  for  voiced  speech  production. 


238 


30 


'/O 


Figure  5:  Prediction  residual  from  covariance-based  adaptive  filter:  (a)  first  125  msecs,  (b) 
last  125  msecs. 


Figure  6  depicts  the  cumulant-based  filter  residual.  During  voiced  periods,  successful  decon¬ 
volution  is  possible  since  the  excitation  is  non-Ciaussian.  and  again  a  quasi-periodic  pulse  train  is 
observed  at  the  output  of  the  filter.  Now,  however,  the  filter  amplifies  the  speech  signal  during 
the  unvoiced  segment.  As  explained  before,  during  this  mode  of  operation,  the  system  identifica¬ 
tion  task  is  ill-posed,  and,  since  this  filter  has  no  power  minimization  criterion,  the  power  of  the 
prediction  residual  becomes  higher  than  the  unvoiced  speech  signal. 

To  make  better  comparisons  concerning  the  energy  of  the  original  speech  and  prediction  resid¬ 
uals,  obtained  via  the  two  different  filters,  we  illustrate  the  energy  estimates  in  Fig.  7.  Energy 
is  estimated  by  first  squaring  the  signal  and  then  performing  low-pass  filtering  using  a  15  point 
Hamming  window.  Fig.  7  shows  that,  by  comparing  the  prediction-residual  power  and  the  original- 
signal  power,  it  is  possible  to  make  reliable  V/UV  decisions.  With  1  he  cumulant-based  method, 
even  better  results  are  obtained,  because  it  amplifies  the  input  data  during  unvoiced  periods. 

The  observations  from  this  experiment,  validate  our  earlier  statements:  however,  using  a  predic¬ 
tor  may  bring  additional  advantages  as  well.  One  important  by-product  is  pitch  period  estimation. 
Pitch  period  is  the  time  difference  between  the  quasi-periodic  excitation  pulses  during  voiced  speech. 
After  the  V’/FV  detection  step,  better  pitch  estimation  is  possible  by  operating  on  the  energy  esti¬ 
mate  of  prediction-residual  rather  than  on  the  original  speech  signal.  From  Fig.  7,  we  observe  that 
the  peaks  in  the  energy  estimate  sequence  are  spaced  by  a  pitch  period  during  voiced  periods  and 
they  are  sharper  than  the  ones  in  the  original  speech  signal  duo  to  combined  filtering  and  squaring 
operations.  Consequently,  we  may  apply  the  correlation-based  approach  described  in  [18]  to  the 
energy  estimate  sequence,  for  a  reliable,  simple  but  robust  calculation  of  pitch  period.  In  [18],  pitch 
estimation  is  accomplished  as  follows:  low-pass  filtered  speech  signal  is  quantized  to  three  levels: 
-1.0.1  and  the  correlation  sequence  of  this  quantized  signal  is  obtained.  Covariance  calculation  is 
simple  with  the  quantized  sequence,  since  it  can  be  performed  only  by  addition.  Finally,  a  peak¬ 
picking  method  estimates  the  pitch  period.  Peak-search  is  performed  on  the  possible  range  of  values 
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figure  <>:  Predict  ion  residual  from  cuiiiulaut  based  adaptive  lifter:  (a)  first  l'2~>  msecs,  (bi 
last  1 2-r>  msecs. 


that  pitch-period  can  take,  which  is  called  t lie  admissible  pitch  range.  \V<  applied  this  method  to 
the  energy  estimate  of  prediction-residual  from  the  rumulant  based  predictor  that  processes  the 
speech  segment  in  fig.  1.  I  lie  original  signal  and  pitch  estimates  are  given  in  I-  ig.  8.  The  deci¬ 
sions  and  estimates  agree  with  t lie  signal  characteristics.  Results  from  tin1  correlation- based  filter 
are  also  accurate  lor  this  speech  segment;  however,  the  accuracy  of  the  correlation-based  method 
depends  more  on  the  threshold  employed  in  comparing  the  power  of  prediction  residual  to  that 
ol  the  input,  than  in  the  cumulatit  based  counterpart,  since  the  latter  amplifies  unvoiced  speech, 
therefore,  we  ran  observe  degradat  ion  in  t  lie  correlation-based  case  since  it  is  sensitive  to  t  lie  value 
ol  t  he  t  hrcshold. 

I  he  second  voiced  speech  segment  in  Fig.  1  is  ,ui  example  ol  i  he  situation  when  harmonics 
are  stronger  than  the  fundamental  frequency  component.  In  general.  <  nrrnlnt  ion -based  approaches 
operating  directly  on  t  lie  speech  signal  tail  when  lids  event  is  present,  lo  demonstrate  this,  we  im¬ 
plemented  the  method  described  in  [dil.  In  [di]  pitch  estimation  is  accomplished  by  calculating  the 
correlation  sequence  ol  the  low -pass  filtered  speech  signal,  and  employing  a  peak-picking  algorithm 
on  the  correlation  sequence.  Peak  searching  is  doin'  on  the  admissible  pitch  range.  For  reliability 
purposes.  I  he  algorithm  also  investigates  I  lie  possibility  ol  pitch  errors,  by  checking  for  peaks  at 
one  hall,  o  in  - 1  hird .  one  Ion  it  h .  one  fi  It  h .  and  one  sixth  ol  the  first  est  imate  of  the  pit  eh  period,  if 
they  are  in  the  admissible  pitch  range.  II  a  peak  at  these  locations  is  larger  in  amplitude  than  half 
ol  that  ol  the  current  estimate,  the  pitch  estimate  is  (  hanged  to  the  location  of  this  peak.  In  our 
experiment,  the  pitch  detector  of  utij  locates  the  ma  |or  peak  at  lag  (>S;  however,  its  decision  rule 
identifies  another  peak  around  lag  dl  which  is  in  t  lie  admissible  pitch  range.  Since  the  amplitude 
ol  the  peak  at  lag  d  I  is  la  rg<*r  than  hall  ol  that  o|  the  major  peak.  I  lie  final  pitch  est  i  mat  e  is  chosen 
to  be  hall  ol  I  lie  cm  reel  value,  whiili  is  a  gloss  error. 
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Figure  7:  Fnergy  estimates.  ( a  >  Original  speech  signal:  lb)  energy  estimate  ot  original 
speech  signal:  jc)  energy  estimate  ol  j > r< •< I i c t  ion  error  residual  Iroin  covariance-based  tiller, 
(d)  energy  estimate  of  prediction-error  residual  from  eimndant -based  tiller. 


2.4  Conclusions 

In  this  work,  we  showed  that  it  is  possible  to  track  transitions  in  the  slate  ot  speech  using  adaptive 
lineai  prediction,  both  covariance  and  <  utiiulant -based  methods  are  investigated,  and  greater 
contrast  between  Y/l’V  cases  is  demonstrated  by  t  lie  latter  method  because  cumulants  can  usi' 
the  difference  ill  the  excitation  model  of  t  lie  two  speech  states. 

bit ch- period  estimation  is  also  possible  by  linear  prediction,  bather  than  operating  on  the 
original  signal,  we  prefer  to  emplov  tin*  prediction-emir  resin  ial  available  horn  an  adaptive  filter 
(  n  mil  la  nl  based  approach  operating  on  the  power  estimate  of  the  residual  process  is  si  own  m  be 
a  practical  way  ol  pitch  estimation. 

Wo  investigated  the  prediction  performance  of  adaptive  predictors  based  on  correlation  and 
c  u  i  n  1 1 1  a  n  t  s  and  found  that  <  umulaiil  based  prediction  can  out  perform  correlation-based  prediction, 
although  the  latter  is  designed  to  minimi/e  the  power  of  the  prediction  lesiuiial.  We  conjectured 
that  outliers  in  the  excitation  model  ol  voiced  speech  result  in  this  phenomena,  better  predic¬ 
tion  perlbrmam  c  obtained  via  cumulants  is  worth  investigating  aualvt  icallv:  however.  this  is  not 
tractable  with  real  or  svnt  In-s'./ed  speech  since  there  are  many  parameters  involved.  Simpler  cases, 
such  as  a  single  sinusoid  in  ( laussian  noise  (  an  be  atialy  /ed  to  evaluat  e  the  per  lor  m  a  nee  ol  cumiilant 
and  cova  ritnice- ba~.ed  adaptive-line  enhameis. 
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|-  mmv  N;  Pit  ch-pcriod  estimation  experiment.  (  a  )  Original  speech  signal;  (b)  energy  es- 
l  muilr  (,■!  predict  ion-error  residual  Irom  cumulatit -based  Idler:  (<  )  pilch  contour  obtained 
by  processing  energy-estimate  sequence  using  the  method  in  [  1 S] ;  and  (d)  autocorrelation 
sequence  ol  the  second  \oiced  speech  segment  processed  by  the  method  ill  [37].  leading  to  a 
gross  error. 


3  Cumulant-Based  Blind  Optimum  Beamforming 

Anas  processing  te<  luiiipies  play  an  important  role  in  enhancement  of  signals  in  the  presence  of 
lilt  e|' ten'll  re.  ,\  number  of  hooks,  a  III  I  all  e\l  elisi  Ve  literal  life  [  1 .'i.dO  X2.  12.  I  I  .">().G  1  .GM  j  have  alfel.dv 
been  published,  (upon  s  minimum-variance  distortionless  response  i\l\  |)|{|  heamformet  [s]  has 
been  a  starting  point  loi  both  signal  enhancemeni  and  high- resolut  ion  direct  ion  ol-arrival  (PO.\) 
est  imat  ion. 

In  recent  years,  i  here  ha-  been  .m  increasing  interest  in  high  resolution  array  processing 
techniques  based  on  eigeudecom  posit  ion  o!  the  covariance  matrix  o|  received  signals  jl.17.2G- 
2 1  .d(>. ;{>.'*(>. (it)  G  I ,()'). 7  I  .  I o  recoM'i  t  In-  signal  o|  int ei e>t  in  the  presence  ol  interfering  signals,  the 
so  called  COP'i  t  ii  n  ct  i<  >n  I-,  used.  In  this  procedure.  llOAA  tor  all  signals  are  li  rst  estimated, 
and  then  'lie  iniiiimum  vaiiauce  processoi  that  reconstructs  tliede-iied  signal  and  minimizes  the 
contribution  ol  all  int  erlereni  e  sources  i-  implemented.  All  ot  I  lie  previously  referenced  methods 
t  e|\  on  complete  k  u  o' v  ledge  of  i  e- pou. -.es  ||  >|  loi  at  ions  of  array  element  s  a  mi /  or  1)0  A  informal  ion 
ol  t  he  ihcsj red  signal . 

II  t  In-  an.iv  iii, untold  i-  ii  u  k  nov.  n  .  or  1 1  o  •  i  ■  -  are  irtn'ertaiui'  ie-.  ii  is  then  i.ecessarv  to  calibrate 
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lip'  .n  r. a  :  1 1» iv\ < ■  \ i * t ,  this  it  not  .1  practical  thing  to  <lo.  since  calibration  must  be  done  quile 

IVi'ii  upii!  ly .  and.  e.icli  tinn'.  array  -  manifold  inhumation  must  be  stored.  In  addition,  calibration 
sources  mav  be  rei| ui i ml .  l.ven  small  errors  in  t lie  calibrat ion  procedure  may  considerably  degrade 
the  performance.  Sensitivity  analyses  ol  high  resolution  methods  and  M\  I)H  beamforming  have 
I . .  pi. ".'tiled  in  ;  I  1.12.1  l.lh.21  2'..2'>. 70.7b]. 

In  this  study.  we  shall  employ  higher-order  statist ics  of  received  signals  to  estimate  t lie  steering 
vectoi  ot  i  he  non  ■(  la Ussian  desired  signal  in  the  presence  of  directional  (iaussian  interferons  with 
uiiknou  ii  covariance  si  i  m  l  me.  We  assume  no  know  ledge  of  array  manifold,  and  1)0 A  information 
about  the  desired  -ignai.  ltesired  signal  may  be  voiced  speech,  sonar  signal,  radar  return  or  a  com¬ 
munication  signal.  In  our  work,  we  specialize  to  the  communications  scenario,  which  requires  the 
Use  ol  Ion  rt  h  order  c u  umlaut  '.  1<  >1  lowing  a  mat  hemat  ical  formulat  ion  of  t  lie  problem  in  Sect  ion  3.1, 
we  describe  blind  '  i  i  1 1 1  a  i  i  o  1 1  and  optimum  beamlorining  procedures  in  Section  3.2. 

Am.  estimation  procedure  is  subject  to  errors,  as  is  our  ciimulatit  -  based  source  steering  vector 
estimation  method.  In  theory,  cumnlauts  are  Mind  to  (iaussian  noise;  however,  their  estimates  are 
■  o; n: pied  ie.  'ii.il  no,".-.  In  oed.-r  to  obtain  sat islact o; y  result s.  longer  data  lengt  Its  are  necessary  in 
cumnlaat  based  signal  processing,  lo  alleviate  the  ellects  ol  estimation  error  in  the  beamloriiiiug 
step,  we  propose  in . >i e  ej ( u  jen i  estimation  procedure  that  fully  utilizes  the  data  acquired  by  the 
arr.iv.  We  imMiei  suggest  a  me’ nod  ot  combining  cumulant  and  covariance  information  to  yield 
le-Me)  estimates.  |  Ipui  we  employ  a  robust  beamforiniug  method  based  on  artificial  noise  injection 
to  combat  mi  - mat  ch  in  t  Im  -on n  e  >i «>er mg  vect or.  W e  consider  t  lie  est  i mat  ion  error  as  a  mismatch 
and  siicce-,-! ullv  apply  t  hi'  fobii't  appioach  to  our  problem.  i  he.se  methods  are  presented  in 

,'e.  I  ioi. 

hi  a  n  im m  uni.  at  loti'  i ui vi i  on  mem  .  mult  i pat  h  propagat  ion  almost  alw  ays  t  ake  place.  In  this  case, 
.di  eiL'em.b'i  omposji  ji (jj  based  techniques  and  \l\  1 ) |{  tail.  Only  in  some  specific  array  configurations 
it  possible  to  de.oi  relate  incoming  signals  and  then  estimate  their  DO.Vs.  We  analyze  the 
i  *  e  i  ■  a  v  i 1 1  i  ..t  our  i  u  1 1 :  a  I  a )  1 1  based  approach  iii  Section  l.  We  show  that  our  pro|)osed  approach 
l> e 1 1 a v e -  a-  tin  optimum  beaiiiloriner  that  maximizes  the  Signal  to  Interference  plus  Noise  Ratio 

mm: 

I  "i  real  ’ line  , .p.-rai ion  ’a  necessary  reipiirement  in  communications  applications)  we  propose 
aii  adaptive  implementation  ol  the  <  ii  1 1 1  ii  la  1 1 1  based  heamlormer  in  Section  3.">.  We  then  present 
- . i ; ;  ; la '  ion  .  pi'i  ime'it  -  to  i nd e  ate  i  Im  perloniiauce  of  our  approach  iii  Section  3.(>.  Finally,  we 

'  i  I  a  Oil  I  i  o|.<  1  li  -  |o|!  s  ;  I,  Se.  I  ioil  .(  .  7  . 

3.1  Problem  Formulation 

We  t< .1  nuii.it.'  "iii  pO'lil.  in  in  a  nai  low  band  lasliioii.  In  array  processing,  a  problem  is  classified  as 
i:  a: -ov.  b„hd  i!  the  -i'jii.d  I.aiidw  nit  h  i-  'in.ill  •  <  >m  [  >a  i  •  a)  to  i  he  reciprocal  of  the  time  required  for 
’  !a  -ivMiai  w  a '.  •  ■!  i  oi 1 1  '  o  p o  .pa" a t e  a.  i o-  -  I  Ic  a  1 1  ay.  I  ot  a  discussion  on  band  widt  ii .  see  [(i().t*3j. 

In  1 '  n  i  b  u  mulat  i'  ui .  h  >.'.  i  ;  and  u  ppet  .  .n1  it  a  lie  letters  are  use.  I  to  represent  scalars,  lower  case 
>o|d  In.'.!  let  1 1  I  •  .  1 1  .  M-.-.l  to!  .  ei  I  Ol  s,  and.  II  |>  per  case  hold  lout  lei  1 1  Ms  are  used  lor  mat  rices. 

3.1.1  Signal  Model 

<  ■  ■ :  .n  . ■  i  :  .r.  •  •!  ’  1  •  . <  ; i , •  ■  i : '  .  w  it  ii  arl.it  I  ai  v  -en-oi  (espouse  ch.u  act  ei  isi  'n  >  and  local  ion ' . 

\  in"  .i :  <  /  *  i.oi  "ii,-  ;  n  i  ■ .  .  ■u.’iiul  j  /  .  t  >.  /  I  .  2 . /  } .  and  a  non  (Iaussian 
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desired  signal  (l{l).  centered  .it  t r< ■< | in -i k  \  tr  .  We  assume  sources  arc  lai  awav  Inun  I  lie  array  so 
that  a  planar  wavefront  approximation  is  possible.  |  he  additive  noise  present  is  assumed  to  he 
<  laussiati  with  unknown  covariance.  With  these  assumption1  the  received  sienal  at  the  /.  t  h  sensor 
ca  n  he  ex  pressed .  as 


.! 

0  . 1  )</(/)  -f  y  <1  kU),,  .!  ',(/)  t  (1) 

.1  =  I 

where. 

•  tij  :  the  direction  ol  arrival  ol  the  wavefront  <  orrosponditifj  to  emitter  .r . 


•  :  response  ol  the  At  h  sensoi  to  .it  h  signal  wavefront,  including  t  fie  phase  factor  asso¬ 

ciated  with  t  lie  travel  time  ol  the  signal  wavefront  witfi  n'spect  to  a  reference  point;  without 
loss  of  .generality,  this  point  can  he  taken  as  the  first  sensor  location. 


•  il(l)  :  the  desired  non -( lau.-sian  signal  as  received  at  sensor  I.  with  variance  rr~,. 

•  i  \l)  :  l  lie  /th  i  1 1 1  erferer  waveform  as  received  at  sensor  1:  interlerence  signals  are  assumed 
to  he  independent  ol  the  desired  siunal.  and  they  are  (iaussian  processes. 


•  llt.  il)  '■  I  lie  addit  ive  noise  at  I  lie  A  t  h  sensor. 


I  apt  a  I  ion  (  I  )  tan  he  rewrit  t  en  in  mat  rix  not  at  ion.  as 


'i(l) 

>’A 1 ! 

i'M\t  i 


at.  H.j ).  al •  •  • .  at  0,  ,  ) 
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"tin 

Min 
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"An 
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.  /'.;(/) 

"Min 

where  al  H ,  I  represents  the  \  /  x  I  steerin.”  vector  lot  I  lie  wavefront  from  emitter  ./•.  which  can  he 
expressed  as 

!  1  / 

at  II,  )  |  n\\ht  ).  ) . u  \  / 1 1)  i  |  (.'$) 

I  1 

\  V  e  (|efi  tie  the  e /'/'<///  /  ml  II  iji  >!<l  as  I  lie  collect  ion  ol  si  eel  lllti  V 'ect  ol  's  over  .ill  I  )()  A  s  ol  jut  efest  .  Alter 
1 1 a  1  ive  expressions  for  the  received  sjetial  vector  ate. 

rt/(  -  A  7d/)  t  h(/i  -  at  a. i  )il[l)  t  A  j  if  / )  *  in  / 1  (1) 

In  this  last  expression,  we  partitioned  the  1/xf./  j  I  |  sleermji  matrix  A  as. 


A  al //./  ).  A  | 


(•"») 


where  the  .1  /  x ./  matrix  Aj  is  the  steerinu,  matrix  lot  inlet  ference  sources. 

Inti  i  is  pa  pei .  we  address  the  pi  ol  dent  ol  opt  im  it  m  hen  niforniinu,  w  it  h  an  a  r  ray  of  sensors  whose 
responses  and  locations  a  r  ■  ■  completely  unknown',  hence,  although  we  max  have  a  priori  knowledge 
a  holt  I  I  lie  (lit  eel  ion  of  arrival  of  do.  1 1  ei|  -  i  ”  1 1 . 1 1 .  We  call  not  pel' lor  ill  he.,  in  lot  tn  I  Illy  due  to  the  hick  of 
knowledge  o|  array  manifold.  Ill  Ai  .  tills  ptohletn  is  addressed;  however,  jd-h  s  algorithm  is  limited 
to  a  silicic  interlerence  sional.  We  mve  t  |oato  tlii-  possihtlttv  of  a  more  eeinnal  solution;  uamclv. 
-  t.'-.ti  a  I  t  et  live  iv  in  t  Im  pi  esence  i  >|  multiple  i  ti  I  <  u  |e  i  eis  w  hose  1 1  >r  i  e|a  I  ioi  i  st  i  in  t  ut  e  is  unknown.  He|<  ire 
present  i  tie.  on  i  a  ppt  oaeh  .  which  ■in  plov  hie  her  onhu  t  a  I  is  I  ics.  we  demonstrate  the  limitations  ol 
eovatiatiie  based  art.iv  pun  e-  inn  I  >  >  i  I  In  ptohletn 
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3.1.2  Covariance-Based  Approaches 

Currently  used  high- resolution  methods  of  l)OA  estimation  and  minimum- variance  distortionless 
response  beamforming  (MVDR)  employ  the  covariance  matrix  of  signals  received  by  the  array.  The 
wavefront  covariance  matrix.  S.  is  defined  as  the  covariance  of  tin'  source  signals  as  received  at  the 
reference  point,  i.e..  at  sensor  i: 

S  =  t'  {z(/)z "(/)}  (6) 

where  (-)11  denotes  complex  conjugate  transpose.  1  sing  the  received  signal  model  in  (4).  we  can 
express  the  MxM  covariance  matrix  R  oi  array  measurements  in  the  following  two  ways: 

R  =  f|r(/)r"|/)}  =  ASA"  +  R„  =  rrj  a(0d)  a" (0l{)  +  R„  (7) 

where  R.,  is  the  noise  covariance  matrix. 

R,  =  f  {  n(t)  n "(I)  }  (8) 

an  d.  R„  is  the  covariance  matrix  of  the  undesired  signals,  i.e.. 

R„  =  f  {  [  Al  i|/|  r  n( / )  ]  [  A-!  i(/)  +  n( / )  }"  }  (9) 

In  general,  the  noise  covariance  matrix.  R„.  is  unknown.  With  some  restrictions  on  array  ori¬ 
entation  and  noise  covariance  structure,  some  approaches  tor  high  resolution  1)0.4  estimation  are 
proposed  in  [47.52]  that  do  not  require  this  information:  however,  these  techniques  have  their  limi¬ 
tations  due  to  involved  assumptions.  Even  with  complete  knowledge  of  noise  covariance  structure, 
source  localization  is  still  impossible  without  the  knowledge  of  array  manifold.  In  [56],  ESPRIT 
algorithm  is  devised  to  overcome  this  problem:  however.  IIS P HIT  requires  transitionally  equiv¬ 
alent  subarrays  with  known  displacement  vectors,  which  may  also  be  impractical  due  to  all  the 
constraints  on  array  orientation.  In  [2lj.  an  eigendecomposit ion-based  beamforming  approach  is 
proposed  which  assumes  the  ideni (liability  of  the  signal  subspace  and  availability  ol  the  steering 
vector  information  for  the  signal  ol  interest.  ( iood  results  were  obtained  under  these  assumptions: 
however,  this  method  can  not  handle  coherent  interference  and  spatially  colored  noise. 

In  [t)-10.57]  .  blind  estimation  of  steering  vectors  for  independent  emitters  is  discussed  with  the 
following  conclusion : 

Wind  estimation  of  source  steering  vectors  is  no t  possible  with  only  second-order 
statistics,  but  en  i  ploying  higher  - 1  h  an -second  ruder  cumulant  s.  it  is  possible  t  o  estimate 
source  steering  vectors  up  to  a  scab1  (actor. 

MX  DR  beamforming  is  an  alternate  approach  lor  signal  recovery.  I  his  approach  however, 
requires  knowledge  of  the  steering  vector  lor  the  desired  source  up  to  a  seal*1  factor  and  uses  the 
covariance  matrix  R,  of  received  signal-  for  processing.  I  In*  output  ol  I  lie  MX  I  >  li  beamformer  //(/) 
<  a  II  be  expressed  as  [n| 

</(/)  -  w"ri/i  -  [  A |  RA  1  a(R/)  \"  r( / )  (10) 

where  the  constant  /,  i-  present  to  maintain  a  specified  response  lor  the  desired  signal  and  w 
denotes  the  weight  \e<|o>  ol  the  processor. 

from  the  above  expression,  it  is  i  lear  that  MX  DR  beamforming  requires  knowledge  of  a{Hd). 
\\  ii  lion t  knowledge  of  array  manifold,  it  D  not  possible  in  determine  n(  Hd )  even  in  t  lie  case  of  know n 
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H,j .  Therefore.  MYDR  heamforming  ran  not  be  directly  applied  loom  problem .  In  addition.  the 
MY  DR  beamforniei  is  <j  u  i  t  sensitive  to  errors  in  assumed  sensor  locations  and  characteristics  [11- 

12.1  1.20.70.7(1]. 

In  many  a[i[)lica(ions,  mult ipafh  f)ro(m«;»t ion  takes  place  resulting  in  coherent  sources.  Coher¬ 
ence  presents  a  serious  problem  to  1)0. \  methods;  it  leads  to  a  singular  source  covariance  matrix 
S.  for  which  it  is  not  possible  to  estimate  source  locations  except  in  some  specific  array  configura¬ 
tions  [  IN-  10.6  1-02. (i(>. 7  1.7")].  In  the  MYDR  case,  source  coherency  does  not  represent  a  problem  as 
long  ;ts  there  is  no  source  correlated  with  the  desired  signal;  however,  this  situation  is  rarely  met 
in  practice.  In  general,  the  desired  signal  is  subject  to  multipath  propagation,  and  performance 
of  MYDR  approach  degrades  severely  (.>l.7Nj.  An  optimum  beamforming  procedure  has  been  sug¬ 
gested  in  [(>]  to  overcome  the  coherence  problem  by  using  a  linear  array  of  elements  with  identical 
direct ional  chara.cterist ics. 

YVe  are  therefore  looking  for  a  method  that  can  overcome  a  1 1  these  problems.  In  the  next  section, 
we  present  an  approach  that  accomplishes  this  by  combining  cumulant -based  blind  estimation  and 
MYDR  beamforming. 

3.2  Cumulant-Based  Optimum  Beamforming 

In  the  previous  section,  we  discussed  tlm  problem  of  optimum  beamforming  and  concluded  that  it 
is  not  possible  to  ivcovei  a  desired  signal  in  the  presence  ol  multiple  interferers.  unknown  sensor 
noise  covariance,  multipath  propagation  and  without  any  information  about  array  manifold.  In 
this  section,  we  propose  a  method  to  overcome  these  problems.  YVe  propose  a  two-step  procedure; 
higher-order-st at ist ics  for  blind  estimation  of  the  source  steering  vector,  followed  by  MY  DR  beam- 
torming  bast'd  on  second-order  statistics  of  received  signals  and  steering  vector  estimate  provided 
by  the  first  stop. 

3.2.1  Estimation  of  desired  signal  steering  vector 

In  this  section,  we  employ  ciimulaiits  of  received  signals,  to  estimate  the  steering  vector  of  the 

desired  signal  up  to  a  constant  factor.  Third-order  ciimulaiits  are  blind  to  signals  with  symmetric 

probability  density  function.  On  the  other  hand,  most  signals  in  communication  environments  have 
symmetric  density  functions,  which  motivates  the  use  of  fotirl  It-ordei  ciimulaiits2.  First,  we  define 
!  lie  Joinili-orili  r  :t  ro-lnfi  iiiiiinlimt  operator  of  complex  processes  { ./  | ( / ).  ./•_.(  / ).  .r.d  / ).  .r  )( / )} .  as 

linn  {-(’i  ( /  h  f>{l  t.  -('d  0.  -r  d  / 1}  /' -  ]''i(/).r.>( /)./■'( /f.c  it /)}  -  /■.' j./|t /)./•_.(/ )}  {id/ D'i(/)} 

-  A'  {j-,(/).r,(/lJ  -  /  {c,t0-r  DM}  /  \-r  :{t  )■'■’.(')}  (II) 

Next  .  COItsidei  I  he  vector  c  j(  |  .  (  , . c\i  j 1  .  defined  as 

r.  -  (/;;//{  /■ ,  I  /  ).  1^(1  ).  /^I /).//(/ )}  1=  1.2 . \l .  (12) 

As  suggested  in  [  I .  there  are  various  way  s  ol  defining  fourth-order  statistics  ol  complex  random 
processes.  We  follow  i  lie  approach  pi'esi-ui  rd  111  j •"*  1  ]  in  (  12  ).  Since  i hi <‘i  |ei'en<  e  signals  are  itidepeu- 
(|ent  ol  I  lie  desired  signal  and  they  are  (iaiissian  with  zero  loiirt  h-ordei  <  uniiilants.  we  can  express 

All  1st  )  I )  1 .1 1  ||  111  |  .1-  ■<  ,  <  1 1 1  ■  I  I  |  .  .  1 1  I  lilt’d  .  I  I’d.  I  st  .’ll  1st  U  s  is  pi  •  SI  lit  I  .  |  III]  i  Hi 
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c>  as 

<7  =.  cum  \u\(H.t  ).s./(  I  ).u\l  {O.i  )>//(  I  ).n[‘  (t ).  «/(0./  )>/(  / )  }  (  13) 

losing  properties  of  cumuianls.  we  obtain 

"i1  (O.i)  ~:d,i  (14) 

where  *,/  i  denotes  the  zi  roth  lag  of  the  fourth- order  cumtilant  of  the  desired  signal.  Defining 
J>  =  ^  ^  / )  './.a  we  have  the  following  expression  for  the  ,\7.\1  vector  c: 

c  =  .f.  R(O.i)  (15) 

Observe'  that  the  vector  c  is  a  n  plica  of  tin  site  ring  a  dor  of  tin  tie  sin  d  signal  up  to  a  scale  factor. 
We  show  in  the  next  section  how  this  information  can  be  used  to  recover  the  desired  signal. 

3.2.2  Interference  Rejection 

With  the  knowledge  of  the  steering  vector  ol  the  desired  signal,  interference  rejection  is  possible 
using  the  following  minimum- variance  distortionless  response  formulation:  find  the  weight  vector 
w  that  minimizes  the  power,  w11  R  w.  at  the  output  of  the  beamformer  subject  to  the  constraint 
v/11  e  —  !.  where  c  is  obtained  via  t  he  cumulaut -based  estimation  procedure  described  in  Sub¬ 
section  3.2.1.  The  solution  to  this  optimization  problem  is  well-known  [*].  and  can  be  expressed 

W =  .i j  R'1  c  (16) 

where  the  constant  f.  —  (c11  R  1  ci  1  is  present  in  order  to  maintain  the  linear  constraint. 

Due  to  tin*  constraint  w^c  =  1.  the  power  minimi/,  it  ion  procedure  does  not  cancel  the  desired 
signal,  but  rejects  all  interference  components  and  sensor  noise  in  the  best  possible  manner.  Note 
that  this  is  accomplished  without  knowledge  of  covariance  structure  of  interference  signals,  sensor 
noise  or  array  manifold.  In  the  sequel,  we  refer  to  the  processor  in  (16)  as  CTMi.  The  proof 
that  this  rumulani -based  beamformer  is  identical  to  the  maximum  SINK  processor  is  provided  in 
Section  3.1.  where  the  general  multipath  case  is  treated. 

3.3  Robust  Beamforming 

In  this  section,  we  first  propose  an  approach  that  utilizes  the  received  data  in  t he  estimation  of 
the  source  steering  vector  in  a  more  efficient  manner.  Wo  then  suggest  a  method  that  uses  both 
cumulants  and  covariance  in  lot  mat  ion  under  some  scenarios.  Finally,  we  employ  a  robust  method 
to  con i b;it  the  elici  ts  of  est imation  errors. 

3.3.1  Efficient  Ut  ilization  of  Array  Data 

In  the  previous  section,  we  presented  a  method  of  blind  estimation  of  the  desired  source  steering 
vector  I  tom  the  received  data:  however,  the  proposed  approach  is  rtit  her  inefficient  in  the  sense  that 
only  the  first  sensor  is  taken  ;e  refereme.  I  or  example,  if  the  connection  from  this  element  to  the 
processor  i'  broken,  then  t  lie  est  imut  ion  objective  can  not  he  accomplished.  Similarly,  due  to  poor 
receiving  circuitry  following  t  hi  -  arrav  element,  the  reference  signal  may  be  very  noisy,  degrading 
the  qua  lit  V  of  I  lie  est  j  mil  t  o.  We  can  met  route  I  he.se  difficult  jes  b  V  Using  >>•  ult  ipie  reference  element  s. 
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Define  the  matrix  C  with  the  (k.  / ) t  h  element. 

(\-.i  =  cum{rk(t).rl,(t).rjf(t).ri(t)}  where  k,l  -  1 . M .  (17) 

With  true  statistics,  the  cross-cumulant  matrix  C  will  have  rank  1,  since  all  its  columns  are  scaled 
replicas  of  the  desired  source  steering  vector;  however,  with  sample  statistics  this  condition  never 
holds.  The  left  singular  vector  of  C  with  the  largest  singular  value  can  be  used  as  the  estimate  of 
the  desired  source  steering  vector  removing  the  effects  of  noise.  In  this  way,  we  utilize  array  data 
more  efficiently  *.  The  beamfonner  that  employs  the  steering  vector  estimate  obtained  in  the  way 
described  above  is  referred  to  a,s  the  Cl’Mj  beamfonner  in  the  sequel. 

In  addition,  the  1’otal  Least  Squares  algorithm,  that  takes  the  errors  in  both  the  received  data 
covariance  matrix  estimate  and  the  steering  vector  estimate  into  account,  is  a  better  choice  for 
computing  the  optimum  weight  vector,  as  suggested  in  [78],  but  it  is  computationally  expensive. 
If  extra  computations  are  feasible,  we  suggest  the  use  of  the  Constrained  Total  Least  Squares 
algorithm  [1],  for  even  better  numerical  results. 

3.3.2  Covariance-Cumulant  (C2 3)  Approach 

In  some  array  processing  applications,  sensor  noise  covariance  structure  has  a  definite  structure 
enabling  a  whitening  operation  on  the  received  data.  The  principal  eigenvectors  of  the  covariance 
matrix  of  this  processed  data  reveal  the  subspace  spanned  by  the  steering  vectors  of  directional 
signals  illuminating  the  array  [o8].  Hence,  the  steering  vector  estimate  obtained  by  the  cumulant  - 
based  approach  can  be  improved  by  projecting  this  estimate  on  the  subspace  spanned  by  the 
principal  eigenvectors  of  the  covariance  matrix.  This  improved  estimate  can  then  be  used  in  the 
beamforming  procedure  of  Section  3.2.2.  The  motivation  behind  this  approach  is  that  covariance 
estimates  exhibit  less  variance  than  cumulant  estimates,  but  in  the  covariance  domain  we  can  not 
identify  the  source  steering  vector  if  there  are  multiple  sources.  This  procedure  yields  an  estimate  of 
the  steering  vector  from  covariance-matrix  information  bv  employing  the  cumulant -based  estimate 
as  side  information.  A  mathematical  description  of  this  approach  is  presented  below: 

1.  From  the  received  data,  estimate  the  covariance  matrix  R  and  the  desired  signal  steering 
vector  c  by  the  cumulant -based  procedure. 

2.  Perform  an  eigendecom  posit  ion  of  the  sample  covariance  matrix,  to  reveal  the  signal  and 
noise  subspaces:  t  lie  eigenvectors  of  R  with  the  repeated  minimum  eigenvalue  span  the  noise 
subspace  [nS].  while  the  rest  span  the  signal  subspace. 

3.  Assume  the  signal  subspace  is  (7  +  1)  dimensional.  Then,  the  basis  vectors  for  the  signal 
subspace,  obtained  from  the  eigendecomposition  procedure,  can  lie  sorted  in  an  M\{J  +  1) 
matrix  E,  with  the  column  space  identical  to  the  signal  subspace. 

I.  Project  the  cumulant -based  steering  vector  estimate  c.  on  the  signal  subspace  to  obtain  an 
improved  estimate  as 

c,„(/,  =  E,E^c 

•">.  Compute  t  fie  weights  for  the  beamfonner.  as 

w  =  R"'c,,„(, 

4 A  method  that  utilizes  the  array  data  even  more  efficiently  is  presented  iu[l!)] 
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3.3.3  Robustness  Constraint 

Any  estimation  procedure  is  inevitably  subject  to  errors.  MVDR  beamforming  is  extremely  sensitive 
to  mismatch  [11-12.14.10,29,70.76].  especially  in  high  SNR  conditions  and  in  arrays  with  large 
number  of  elements.  A  variety  of  constraints  have  been  summarized  in  [68]  assuming  perfect 
knowledge  of  element  characteristics  and  locations;  however,  in  our  case  these  methods  are  not 
applicable  since  there  is  no  available  information  about  the  array  manifold  to  design  effective 
constraints. 

Errors  in  the  steering  vector  estimate  result  in  signal  cancellation.  This  mismatch  condition, 
arising  from  non-perfect  estimation,  can  be  viewed  as  the  problem  of  optimum  beamforming  with 
an  array  of  sensors  at  slightly  perturbed  locations.  In  [15],  a  method  that  constrains  the  white 
noise  gain  of  the  processor  is  proposed  for  the  solution  of  the  latter  problem.  In  this  section,  we 
use  the  same  approach  to  alleviate  the  effects  of  estimation  errors  in  cumulant-based  optimum 
beamforming. 

In  order  to  understand  the  mismatch  problem  and  find  a  way  to  alleviate  its  effects,  we  need 
to  analyze  the  problem  analytically.  Consider  the  power  response  of  a  beamformer  with  a  weight 
vector  w.  as  a  function  of  DOA  B.  defined  as 


P(B)  =  |w"a(0)|2  (18) 

with  a($)  denoting  the  steering  vector  for  an  arrival  from  B.  The  derivative,  dP{B)/06 ,  can  be 
expressed,  as 

^  (  £>,  |  «>)  1)  119) 

Now  consider  the  following  scenario:  we  have  an  MVDR  processor  looking  at  Bu,  which  is  the 
expected  DOA  for  the  desired  signal.  Instead,  the  source  illuminates  the  array  from  B,j  which  is 
very  close  but  not  equal  to  Bu.  In  this  case,  the  beamformer  treats  the  desired  signal  as  interference 
and  nulls  it:  however,  due  to  the  distortionless  response  constraint  for  B0.  and  since  the  angles  are 
very  close,  the  derivative  0P(B)/dB  must  be  large  in  magnitude  for  B  between  Bj  and  BCI.  From 
the  derivative  expression  (19).  it  is  clear  that  this  is  possible  only  if  the  norm  of  the  weight  vector 
increases,  since  the  inner  product.  wlla(B).  and.  the  derivatives.  { ^  }j^,  are  bounded.  In 

this  situation,  the  constraint  is  maintained  by  increasing  the  angle  between  the  weight  vector  and 
the  look-direction  steering  vector.  This  phenomena  was  exploited  in  [77],  for  tuning  the  beamformer 
to  acquire  a  weak  desired  signal  in  the  presence  of  strong  interference. 

Note  that  the  white-noise  amplification  factor  for  any  processor  with  a  weight  vector  w  is  w11  w: 
hence,  the  nulling  phenomena  can  be  prevented  if  the  white  noise  level  at  the  processor  is  sufficiently 
high  so  that  output  power  minimization  criterion  limits  the  increase  in  the  norm  of  w.  This  can  be 
achieved  by  perturbing  the  covariance  matrix  estimate  of  array  measurements  by  a  scaled  identity 
matrix  as. 

R,,  =  R  +  «I  (20) 

where  (  is  a  non-negative  parameter  which  adjusts  the  strength  of  perturbation.  Alternatively,  it 
is  possible  to  coin  a  term  rirhinl  SNR.  SNR,,  defined  as 

SNR,  «  SNR  -  10  log,, ,  (-^-^)  (21) 

(7* 
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We  tin'll  determine  tlit'  weight  vector 


w  =  R;'a(»„) 


A  recent  iiit'tlmtl  i>i’t'st'nt i'il  in  [l'»]  performs  this  procedure  in  an  adaptive  fashion  l»v  a  simple 
scaling  of  the  we iglif  vet  tor.  lit  our  ra.se.  in-  tlt>  not  liave  source  J)OA  information,  but  we  do  have 
an  estimate  of  the  steering  vector.  It  is  therefore  possible  to  use  this  estimate  in  place  of  a(0„ )  in 
(22)  to  formulate  the  cumnlant -bast'd  processor  with  limited  signal  nulling  property. 

3.4  Multipath  Phenomena 

Kigendecomposition- based  high-resolution  methods  [l.l7,2G-27,dG.dS,.r>G.G0-Gl, 69,71]  have  proven 
to  be  effective  means  of  obtaining  bearing  estimates  of  far-field  narrowband  sources  from  noisy 
measurements.  The  performance  of  t  hese  algorithms  is  severely  degraded  when  coherence  is  present. 
Several  methods  have  been  proposed  to  solve  the  coherent  signals  problem  with  restrictions  on 
array  geometry  [-IS-  Iti.ti I-G2.GG.7  1.7",];  however,  with  lack  of  knowledge  of  array  manifold  it  is  not 
possible  to  solve  t lit'  coherence  problem.  \l\  l)R  beamforming  also  fails  to  perform  optimally, 
when  interference  signals  art'  correlated  with  I  lit'  desired  signal  [ot.'S],  In  some  scenarios,  even 
the  conventional  boamformer  outperforms  the  MVDR  approach  tine  to  signal  cancellation  in  the 
MVDH  boamformer. 

In  Section  d.2.  we  showed  that  tin'  iiimulant -based  beamformer  is  not  affected  bv  the  presence 
of  coherence  among  interfering  (laussian  signals  as  long  as  they  are  not  correlated  with  the  desired 
signal.  The  same  is  not  possible  for  high-resolution  l)OA  estimation  methods;  hut.  the  .V1VDR 
beamformer  may  perform  equally  well  if  t  lie  desired  signal  steering  vector  is  known  and  a  satisfactory 
estimate  of  R  is  available,  in  this  section,  we  show  that  the  cumulaut-based  approach  is  not  affected 
In  the  presence  of  multipath  propagation  of  the  desired  signal.  In  addition,  we  show  that  tlu 
tuniulant-ba.'-ul  pent  <  wir  turn.'-  out  to  lx  tlu  ma.a  null- rat /o-contbuu  r  [•"]  that  nuuimi~ex  the  SIN  It. 

W  it  h  t  he  presence  of  mult  ipal  h  propagation  or  smart  jamming,  our  signal  mode!  in  ( 1 )  changes 
to 

/.  ./ 

a(/)  =  <l(i)  /,(')  +  «*(/)  (2d) 

i=i  j— i 

or  in  vector  lorm 


a  [Hi,  ).  a  [ft.,, ).  •  •  • .  ) 


dit)  +  A!  i(/)  +  n(/) 


wht're  the  set  ol  scalar-  {  // 1. r/_> . ///  ]  constitute  the  multipath  coefficients  for  an  L- ray  scenario. 

I  In-  st't  of  vectors.  (  ).  . a ($;,  )  }  are  t  In'  correspontling  steering  vectors  of  the 

/.  lav  model.  I.et tint* 


a (0./,  ).  al  Hi  ). 


=  Ad  // 
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we  can  reduce  the  signal  model  for  multipath  phenomena  to  the  single-ray  propagation  model  of 
Section  3.1.1, 

r (/)  =  b  d(t)  +  Ai  i(t)  +  n(0  (26) 

because  we  can  view  the  vector  b  as  a  generalized  steering  vector  for  a  single  desired  signal  although 
it  may  not  be  a  vector  in  the  array  manifold.  Therefore,  following  our  work  in  Section  3.2,  cumulant- 
based  blind  estimation  procedure  will  yield 


c  =  d4  b  (27) 

where  da  =  |6i|2  b[*  7 in  which  61  is  the  first  component  of  b.  Incorporating  (27)  into  the 
constrained  power  minimization  procedure,  we  obtain  the  following  weight  vector, 

W cum  =  ,d5  R'1  c  =  d4ds  R_1  b  (28) 


where  d5  =  (  cH  R_l  c  )_1. 

Next,  we  find  an  alternate  expression  for  w..,,,,, .  Recall  that  the  optimization  problem  which 
results  in  w cuin  is:  minimize  wwRw  subject  to  vj11  c  =  1,  or  by  (27),  wwb  =  1  /d4.  We  can  express 
the  output  power  in  the  following  way  by  using  (9)  and  (26), 

whRw  =  (Tj  I  w"  b  |2  +  whR„w  (29) 

but.  due  to  the  constraint  wHb  -  1  / ;d4 ,  the  first  term  in  the  above  expression  is  a  constant. 
Therefore,  the  original  optimization  problem  can  be  translated  into  :  minimize  wwR„w,  subject 
to  wHc  =  1  or  equivalently.  w/2b  =  l/d4.  The  solution  to  this  problem  is 

w,um  =  d(i  R“'  c  (30) 

where  d6  =  ( c 11  R  1  c)'1.  Of  course,  this  solution  can  also  be  expressed  in  terms  of  b.  as 

w,,„m  =  dr  Rj-1  b  (31) 


where  dr  =  d.| d(1. 

Note  that  although  (30)  and  (31 )  are  alternate  expressions  for  w,.„m.  they  are  not  the  way  to 
actually  compute  since  R„  is  not  available  in  general. 

Next,  we  determine  the  weight  vector  that  yields  the  maximum  SINR.  SINR  can  be  expressed 
as  a  function  of  the  weight  vector  of  the  beam  former,  as 


SINR  ( w ) 


2  ww  b  bH  w 
T,<  wH  R„  w 


(32) 


Defining,  v  =  R„  w  so  that  w  -  R„  '  v.  we  ran  reexpress  (32),  as 


SINK  (w)  -  SINR  (  R,;1/2v)  =  rrj- 
Applying  the  Schwarz  inequality  [60]  to  (33).  we  find  that 


v"  r!/2  b 

v"  V 


(33) 


SINR(w)=  SINR  (  R,;1/2  v)  <  a,j  j|  R,;1/2  b  ||2  =  trj  bL  R,7‘b  (3-4) 
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where  equality  holds  if  and  only  if 

v  =  ,i8  R'I/2  b  (35) 

in  which  ;i&  is  a  non-zero  constant.  Consequently,  the  optimum  weight  vector  wsinr.  which  yields 
the  maximum  SINK,  can  be  determined  from  w  =  R,7^‘v  and  (35).  as 

wsinu  =  d8  R71  b  (36) 

Hast'd  on  this  derivation,  some  comments  are  in  order.  It  is  clear,  by  comparing  (31)  ami  (36), 
that  the  cumulant-based  beamformer  does  indeed  yield  the  maximum  possible  SINR,  since  wcurT1 
is  just  a  scaled  version  of  wsinr-  This  observation  proves  that  the  cumulant-based  beamformer  is 
optimal.  In  addition.  w0ttm  can  be  computed  from  the  received  data,  whereas  wsinr,  as  imple¬ 
mented  in  ( 36),  requires  knowledge  of  R„.  which  can  not  be  determined  from  the  received  data  in 
the  presence  of  the  desired  signal.  Finally,  note  that  robust  approaches  presented  in  Section  3.3 
are  directly  applicable  in  the  presence  of  multipath. 


3.5  Adaptive  Processing 

In  real-world  applications,  adaptive  beamforming  is  an  important  requirement,  especially  when  the 
desired  signal  source  is  in  relative  motion  with  respect  to  the  array.  In  this  section,  we  address  this 
problem  by  providing  an  "estimate  and  plug"  type  of  adaptive  algorithm  for  the  CU.Mi  method. 

The  beamforming  procedure  (  16)  requires  the  inverse  of  the  sample  covariance  matrix  to  com¬ 
pute  the  weights.  We  can  estimate  the  covariance  matrix  recursively,  as 

R<  =  ( 1  -  o  i  )Rt-i  4-  o,r(/)r;/(/)  (37) 


Since  we  need  to  propagate  the  inverse  of  R,.  we  use  the  Sherman- Morrison  formula  [46],  to  obtain 


Rr1  - 


I  -n, 


[R,_'i  -  oi 


R,_1,r(/)r//(/)R(_,1 


-r"(/)Rr-\r(/)j 


I  =  1.2. 


(38) 


with  R„  1  =  *  I  w  here  *  is  a  large  positive  number  and  o,  controls  the  learning  rate  for  second-order 
statistics. 

To  compute  the  weight  vector,  we  also  need  the  cumulant-based  estimate  of  th  -  source  steering 
vector  c.  We  can  estimate  it  recursively  as 

'7(0  =  (I  -n  ,)(';(/-  l)  +  o,[  h(/HM,(/)>7(0  ~  WDq(t)  -  >JI[t)x(t)}  (39) 

with  the  auxilary  processes  defined  as 


/>(/)  =  (  I  -  "t  )/>i  I  -  1  )  +  o:t|c,(/)|‘ 

'/(/)  -  l  1  -  "  !)'/(/  -  I  )  +  "a r[l(l)r,(t) 
r(t)  =  (1  -o  «)'■(/  -  1)  +  o;,/-f(/) 

-'■(/)  =  (  1  -  o  ,).!•(/  -  1  )  +  c» ;{ ;•  |  ( /  )c/(  / ) 

I  lie  auxiliary  processes  are  required  in  order  to  implement  t  he  cross-correlation  terms  in  (11).  File 
initial  values  lor  the  auxilary  processes  can  be  set  to  zero.  Different  learning  rates  are  provided 
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to  emphasize  the  fact  that  higher onln  statistics  require  longer  periods  to  acquire  the  required 
information. 

We  can  perform  adaptive  beamforming  by  computing  the  weight  vector  at  each  time  as 

w(t)  =  Rr 1  c(f )  (40) 

and  obtain  the  array  output,  as 

y(1)  =  wH(t)r(t).  (41) 

Adaptive  versions  of  CUMj  and  C2  methods  will  appear  in  a  later  publication. 


3.6  Simulations 

In  this  section  we  present  various  experiments  to  illustrate  the  performance  of  cumulant-based 
beamforming.  In  all  of  the  experiments  we  employed  a  uniformly  spaced  linear  array,  rather  than  an 
arbitrary  geometry.  This  is  done  for  two  reasons:  Covariance- based  techniques  are  mainly  designed 
for  this  type  of  array  structure,  e.g..  the  spatial  smoothing  algorithm  [4, S-49. 6 1-62,66.74,75],  so  that 
it  will  be  possible  to  compare  both  previous  and  future  work  with  our  current  results.  In  addition, 
allowing  a  sufficient  number  of  multipath  rays,  it  is  possible  to  represent  any  arbitrary  steering 
vector  by  the  linear  array,  since  the  steering  vectors  of  the  uniformly  spaced  isotropic  linear  array 
exhibit  Vandermonde  structure,  resulting  in  linearly  independent  vectors  for  different  DOA’s.  In 
all  batch  type  of  experiments,  the  record  length  is  1000  snapshots  and  the  array  has  10  isotropic 
elements  with  uniform  half- wavelength  spacing. 

3.6.1  Experiment  1:  Desired  Signal  in  White-Noise 

In  this  experiment,  we  employ  the  linear  array  described  above  for  optimum  reception  of  a  BPSK 
signal,  which  is  expected  to  arrive  from  broadside  in  the  presence  of  temporally  and  spatially  white, 
equal  power,  circularly  symmetric  sensor  noise:  however,  the  desired  source  illuminates  the  array 
from  -51'  broadside. 

Our  first  MVDR  beamformer.  MY1)R|.  looks  to  broadside,  i.e..  a  mismatch  condition.  Our 
second  MVDR  beamformer.  MY  DR?.  uses  exact  knowledge  of  DOA  of  the  desired  signal.  YVe  also 
employ  the  cumulant-based  beamformer  of  Section  3.2.  (TM[,  and  the  improved  cuntulant-based 
beamformer  (T.Mj  of  Section  3.3.1.  YYo  investigate  the  performance  of  these  processors  for  the 
following  two  elemental  SNR  levels:  20  dB  for  a  strong  signal  and  0  dB  for  a  weak  signal.  Note 
that  the  white-noise  gain  of  any  processor  is  limited  to  10  dB  by  the  number  of  sensors  [15], 

The  beampatlern  responses  (IS),  and  white-noise  gains  of  these  beamformers  are  presented 
in  Fig.  9  for  SNR  =  20  dB.  All  responses  arc  normalized  to  have  a  maximum  value  of  0  dB.  For 
comparison  purposes,  the  optimum  beamformer  response,  calculated  by  using  true  statistics  in  (16). 
is  presented  as  the  dashed  curves.  Observe  that  due  to  the  mismatch  condition.  MY’DR]  nulls  the 
desired  signal.  More  interestingly,  the  MY'DR?  processor  that  utilizes  the  true  DOA  information 
does  not  improve  the  SNR.  due  to  the  mismatch  arising  from  the  use  of  a  sample-data  covariance 
matrix.  The  cumulant-based  processors.  CFMi  and  (  I’M,?,  yield  excellent  performance  without 
any  knowledge  of  source  DOA.  It  i s  rrry  important  to  obstrrt  that  tli<  pt  i/ormancr  of  cumulant- 
bas<  <1  pmr<  s.sors  an  In  tit  r  than  that  of  tin  .1/1  Dll  with  trot  tip  known  look-flirt  (  lion. 

YYo  performed  100  Monte-Carlo  runs  to  investigate  the  performance  in  a  better  way.  The  results 
are  given  in  Table  1 . 
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Figure  9:  Beampat terns  and  white-noise  gains  of  processors  in  a  single  realization  for  SNR  = 
'20  (IB  :  (a)  M\  I)R|.  (I>)  M\  [)R2.  (<  )  Ct  Mi.  (d)  t’l  M>.  'I'lie  optimum  pattern  is  illustrated 
in  dashed  lines  for  comparison  purposes. 


figure  10:  Beampal  terns  and  white  noise  gains  of  processors  in  a  single  realization  for  SNR 
~  0  d  B  :  ( a )  \1\  l)R  f .  ( I  > )  MV  B  i  .  (c)  Cl  M| .  ( d )  (TM,.  1  he  opt  in  in  in  j>a  t  tern  is  illust  rate<l 
in  dashed  lines  lor  comparison  purposes. 
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Figure  I  1:  Power  of  cumulant -based  beamforming:  (a.)  received  signal  at  the  reference  ele¬ 
ment  at  SNR  =  0  dB.  (b)  output  of  CTM2  processor. 

From  these  results,  it  is  clear  that  cuunilant-based  processors  are  superior  and  the  extra  compu¬ 
tation  involved  in  CUM*  reduces  the  variations.  Note,  also,  that  variations  in  the  MVDR  processors 
are  significantly  larger  than  those  of  the  cumulant- based  counterparts.  This  agrees  with  the  previ¬ 
ous  remarks  about  the  sensitivity  of  MVDR  processing  to  experimental  conditions  in  a  high-SNR 
environment. 


Table  1:  Results  from  1 00  Monte-Carlo  Runs  for  Experiment  1 


Processor 

White- Noise  Gain  (dB) 

SNR=20dB 

SNR=0dB 

Mean 

Sid. 

Mean 

Std. 

MVDR, 

-38.130 

l.o  79 

0.113 

0.281 

MVDRj 

0.179 

1 .300 

9..5S3 

0.1.31 

CTM, 

9.954 

0.015 

9.05S 

0.359 

CT.V1, 

9.990 

0.003 

9.959 

0.014 

We  performed  the  same  experiment  for  0  dB  SNR  condition.  Figure  10  illustrates  the  beam- 
pattern  responses  and  white-noise  gains  of  the  processors.  Monte-Carlo  results  art1  also  given  in 
Table  1.  In  this  low-SNR  condition.  MVDR  results  are  expected  to  improve  since  the  mismatch 
conditions  lor  the  desired  signal  will  be  masked  by  the  presence  of  white  noise  of  comparable  power, 
as  explained  in  Section  3.3.  MVDR)  processor  does  not  offer  a  significant  gain  due  to  the  persistent 
mismatch  condition,  but  MVDIC  yields  a  near  optimum  result,  since  presence  of  higher-le-  1  noise 
masks  the  mismatch  due  to  the  use  oi  a  sample-covariance  matrix.  1  he  performance  of  Cl  M i 
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Figure  12:  Beamfurming  in  the  presence  of 'spatially  colored  noise:  (a)  Spatial  Power  Spectral 
Density  oi  noise,  (b)  Beampattern  of  (TM.  processor.  The  optimum  pattern  is  illustrated 
iu  dashed  lines  for  comparison  purposes. 

processor  is  slightly  below  than  that  of  MYDR2  and  exhibits  more  variations.  This  is  due  to  the 
inefficient  use  of  the  array  data,  since  a  high-level  of  noise  corrupts  the  cumulant  estimates  and 
will)  (T.Mj  there  are  no  precautions  to  combat  these  errors.  As  expected.  CUM2  overcomes  this 
problem  by  using  SYL).  Results  in  Table  I  indicate  that  CTM2  achieves  the  best  performance  w.th 
minimum  variations. 

Finally,  to  demon  Urate  the  power  of  cumuiant-based  beamforming,  we  illustrate  the  received 
signal  ami  the  output  of  CT.M^  processor  for  S.\"R=0  dB  case  in  Fig.  11.  It  is  clear  that  CUMj  is 
capable  of  sufficient  noise  rejection  for  performing  correct  decisions. 

3.6.2  Experiment  2:  Spatially  Colored  Noise  and  Multipath  Propagation 

In  this  experiment,  we  investigate  the  performance  of  the  proposed  approach  in  t he  presence  of 
spatially  colored  noise.  YVe  employ  the  linear  ana,  ot  the  previous  experiment.  YVe  assume  that  the 
noise  field  is  created  by  a  set  of  point  sources  distributed  symmetrically  about  the  broadside  of  the 
linear  array.  As  suggested  in  [07j.  this  source  structure  is  typical  when  the  noise  field  is  spherically 
or  cvlindrically  isotropic.  In  this  case,  the  noise  covariance  matrix  is  symmetric -Toeplitz.  In  our 
experiment,  we  u  e  the  following  stiucture  for  the  covariance  matrix  of  undesired  components. 

R„(hj)=O.Sl‘-j|  (42) 

1  he  spatial  power  sport  rum  of  undesired  components  is  illustrated  in  Fig.  12a.  It  is  clear 
that  most  of  the  noise  leaks  into  the  system  from  broadside.  Tin  desired  signal  illuminates  the 
array  from  broadside,  with  an  SMi  of  id  (ill.  lo  illustrate  the  optimum  combining  property  of  our 
approach,  we  implanted  an  exact  replica  of  t  lie  desired  signal  illuminating  t lie  array  from  60".  where 
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lal>lr  2:  Results  Irom  100  Monte  (  ailo  1  i 1 1 1 1  s  tin'  Experiment  2 


1 

1  ri'ocRssor 

SNR,, 

(dR) 

i 

Mean 

Sul 

[  (I'M, 

23.(01 

0.017 

r  (tm. 

23.(05 

0.015 

noise  power  is  relatively  le-s  when  compared  to  that  from  broadside.  The  beam  pat  tern  of  C’UMo 
l>nu  <■"<>!  is  ejveii  in  1  i>*  J 2b.  l  or  comparison  purposes,  wo  present  tire  response  of  the  optimum 
l>e.,:ii!oi  lie-r  based  on  :  statistical  information,  as  a  dashed  curve.  The  maximum-possible  SNR 
at  i  he  out  put  i'  'I  t.oso  ,|  If  for  t  his  scenario,  it  is  clear  that  the  response  of  CTM2  is  almost  identical 
to  that  of  the  optimum  beamformer:  both  processors  emphasize  the  signal  illuminating  the  array 
from  t.n  .  since  the  none  ■  out  ribut  ion  is  less  in  this  region.  We  performed  100  Monte-Carlo  runs 
to  i  thi-  enario.  .md  the  tesults  are  presented  in  Table  2.  It  is  clear  that  both  cumulant-based 
processor-  pe i to i  i 1 1  eipi.tlK  well.  1  lie  reason  for  this  phenomenon  is  the  presence  of  the  multipath 
from  M)  through  a  low  noise  background  that  virtually  increases  the  effective  SNR.  which,  in 
turn,  alleviate-  the  etfects  of  estimation  errors.  Note  that  the  peak  of  the  beampattern  is  slightly 
shifted  trom  00  .  in  order  to  receive  less  interference.  Similar  behavior  is  observed  in  rovariance- 
bu.-ed  tlirect ion -i if- arrival  estimation  in  the  presence  of  colored  noise  resulting  in  biased  estimates 

ol  parameters. 

3.6.3  Experiment  3:  Effects  of  Robustness  Constraint 

In  t  h  is  experiment .  we  iliusf  rate  I  he  effect  s  of  t  lie  robust  ness  const  raint  of  Section  3.3.3.  on  a  CUMj 
processor  in  t  he  presence  of  white  noise.  We  employ  t  he  same  array  as  in  t  he  previous  experiments. 
We  employ  (I'M),  since  i  his  processor  uses  t  he  data  inefficient ly.  and  requires  a  robust  approach.  In 
om  experiment,  we  consider  t  lie  sit  nation  with  SNR=()  <IH.  figure  13  illustrates  the  beampatterns 
of  (T.Mj  pi  i  u  essor  for  several  SNR.  values,  h  is  clear  from  the  results  that,  as  the  perturbation 
ini  Teases,  the  patterns  match  better  since  the  mismatch  due  to  estimation  errors  in  the  steering 
vector  estimate  are  masked  by  the  presence  ol  virtual  increased  level  ol  noise.  This  method  should 
be  used  sparingly  in  the  presence  of  jammeis.  because  virtually  increasing  the  noise  le,-el  results  in 
diverting  the  capability  of  the  array  from  nulling  t  lie  directional  intertcrenro. 

3.6.4  Experiment  4:  Multiple  luterferers 

lii  thi-  e;;  pe|  j  i  ni'iii  .  v.e  consider  the  problem  ul  beamlurming  in  a  multipath  environment  in  the 
p i e-.e ii i  e  > >f  multiple  |aiuiii‘‘is.  We  <'iii | > I < » \  tlm  same  array  as  in  the  previ<»tis  experiments.  1  he 
-ignal  ol  interest  ori.uinates  from  a  Rl'SK  commimicatioii  source,  and  ii  is  expected  from  broadside: 
bov,  ever,  due  I  o  m  ii  1 1  i  pa  t  !i  effect  mult  ipl»*  delayed  and  shift  ed  replicas  are  received .  1  here  are  I  wo 
jaimiie;  and  one  is  -ubjeci  to  multipath  as  we||.  labb-  3.  simimtiii/es  t  lie  signal  structure. 

Note  that  i  here  a  i  <■  10  wavefront  s  illuminating  the  a  rray  and  il  is  not  possible  to  i>st  imate  I  heir 
1)0. \  -  with  a  1 1  v  e  x  i  - 1  iija  high -resolution  method:  hence,  signal  <  ()1W  algorithms  [5S]  can  not  be 
U-.ed  •■Veil  with  perh-1  t  knowledge  of  I  In'  array  ll  I  a  II I  loll  1 . 
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Table  3:  Signal  structure  lor  Experiment  4 
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figure  11:  Beampat terns  and  array  gains  of  processors:  (a)  M\  DR  with  correct  look  dire, 
(ion.  ( b)  ( ’I  M,.  1  he  opt  i  mum  pat  ten:  is  if  lust  rated  in  dashed  lines  for  comparison  purpose: 
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Table  I:  Results  from  100  Monte-Carlo  Runs  tor  Experiment  4 


Processor 

SINRf,  (dB) 

Mean 

Std 

MVDR 

-28.424 

4.405 

CUM, 

4.110 

2.118 

CTM, 

10.290 

0.740 

1 1 .879 

0.027 

Experiment  2.  it  estimates  the  gc n<  ralizrd  steering  vector  of  the  desired  signal  and  combines  the 
wavefronts  to  enhance  SINK  at  the  out  put.  (TM-^  puts  a  null  on  the  jammer  from  —l1'.  destructively 
combines  the  wavefronts  from  the  first  jammer  by  weight-phasing  rather  than  null-steering,  and 
reinforces  the  wavefronts  from  the  desired  source. 

Finally,  we  implement  the  (’*  beamformer  suggested  in  Section  3.3.2:  we  first  estimate  the 
steering  vector  as  done  for  CE.Mj,  but  then  further  project  it  into  the  subspace  spanned  by  the 
principal  eigenvectors  of  the  sample  covariance  matrix.  We  use  the  resultant  vector  as  the  estimate 
of  the  desired  signal  steering  vector,  and  construct  an  MVDR  beamformer  based  on  it.  The 
performance  of  the  resultant  processor  is  demonstrated  in  Table  4. 

We  observe  that  by  combining  cumulants  with  covariance  information,  we  obtain  the  best 
results. 

3.6.5  Experiment  5:  Adaptive  Processing 

In  this  section,  we  demonstrate  the  results  from  the  adaptive  version  of  CEMi  approach  as  described 
in  Section  3.5.  We  employ  the  10  element  uniform  linear  array  of  previous  experiments.  The  initial 

pattern  of  the  beamformer  is  designed  to  be  isotropic,  by  letting  c(0)  =  [1.0 . 0]7.  Desired  signal 

illuminates  the  array  from  broadside  with  SNR=10  dll.  A  jammer  with  power  equal  to  that  of  the 
desired  source  is  present  at  30".  Note  that  there  is  no  noustationarity  involved  in  this  experiment: 
our  aim  is  to  demonstrate  the  evolution  of  the  beamfonning  process  and  indicate  the  data  lengths 
required  for  cumulant  and  covariance  est  filiation.  Tracking  properties  will  lie  included  in  our  future 
work,  including  comparisons  with  adaptive*  versions  of  (.T.Mj  and  C2  processors. 

Figure  15  illustrates  the  beampat teni  of  the  adaptive  (TM,  processor  as  time  evolves.  After 
100  snapshots,  the  beampat  tern  is  .-.till  close  to  isotropic.  At  300  snapshots,  covariance  matrix 
estimate  is  improved,  indicating  the  presence  of  desired  signal  from  broadside.  At  this  time  point, 
the  cumulant -based  steering  vector  estimate  lias  not  matured,  so  it  can  not  prevent  the  desired 
signal  from  being  cancelled.  After  500  snapshots,  cuimilaiit  estimates  get  better,  and  there  is  a 
tendency  to  cancel  the  interference  rather  than  the  desired  signal.  Finally,  after  700  snapshots  the 
processor  removes  t lie  interference  by  null  steering. 


3.6.6  Experiment  6:  Effects  of  Data  Length 

In  iIiin  section,  we  employ  I  he  lineal  array  of  Experiment  ).  with  the  Name  noise  conditions,  and 
vary  the  data  length  to  oI>noi  ve  the  behavior  of  the  beaml’oimeiN  (T\l|.  (TMj.  M\ DRj  and 
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MV'DRj.  Figure  16  demonstrates  the  variation  of  white  noise  gain  of  the  processors  with  data 
length,  for  OdB  and  20dB  SNR  levels.  Each  point  on  the  plot s  is  obtained  by  averaging  the  results 
from  50  Monte-Carlo  simulations. 

From  Fig.  Iba  it  is  clear  that  CIM2  outperforms  all  the  processors,  including  MVDRj  which 
utilizes  the  correct  look  direction  for  all  data  lengths.  Furthermore,  small  sample  properties  of 
CFM2  are  quite  impressive,  motivating  further  research  for  developing  its  adaptive  version.  Low 
SNR  masks  the  mismatch  in  MYDR2  due  to  the  use  of  sample  covariance  matrix;  hence,  as  can  be 
seen  from  Fig.  16a.  (  I'M]  is  inferior  to  MYDR2. 

Figures  16b  and  16c.  indicate  the  effect  of  higher  SNR  on  performance.  CUMi  and  CUM2 
perform  almost  identical  for  all  data  lengths.  Their  gain  is  larger  than  9  dB  even  for  less  than  50 
snapshots.  MVDR2  can  not  recover  in  this  experiment  since  the  mismatch  results  in  severe  signal 
cancellation.  We  do  not  include  the  response  of  MYDRi,  because  its  performance  drifts  around 
-35  dB. 

Tht sc  results  indicate  that  our  approach  has  very  promising  small  sample  behavior  that  deserves 
mart  research.  This  will  be  a  topic  of  another  paper. 


3.7  Conclusions 

We  have  presented  optimum  beamforming  algorithms  for  non-Gaussian  signals,  which  are  based 
on  fourth-order  cumulants  of  the  data  received  by  the  array.  Our  proposed  methods  do  not  make 
any  assumption  about  the  sensor  locations  and  characteristics,  i.e..  they  are  blind  beamforming 
methods.  C'umulant-based  estimation  is  employed  to  identify  the  steering  vector  of  the  signal 
of  interest  and  MY  1)R  beamforming  using  this  estimate  is  used  to  remove  Gaussian  interference 
components.  We  have  suggested  several  approaches  to  combat  effects  of  estimation  errors.  We  have 
also  implemented  a  recursive  version  of  the  method  to  enable  real-time  beamforming.  Simulation 
experiments  demonstrate  the  performance  of  our  approaches  in  a  wide  variety  of  situations.  It  is 
important  to  emphasize  that  the  proposed  methods  outperform  an  MYDR  beamformer  with  an 
exactly  known  look-direction. 

In  our  future  work,  we  shall  address  the  problem  of  optimum  beamforming  in  the  presence 
of  multiple  non-Gaussian  interferers  and  design  of  adaptive  algorithms  with  better  convergence 
properties. 


4  Final  Comments 

In  this  paper,  we  summarized  our  recent  research  results  on  the  applications  of  cumulants  in  speech 
and  array  pr<  essing.  The  results  are  very  promising,  and  encourage  further  study  in  these  areas. 

We  acknowledge  that  especially  in  speech  processing,  cumulant  applications  are  still  in  a  very 
premature  state.  Array  processing,  however,  captured  more  attention,  particularly  after  the  excel¬ 
lent  work  in  [9].  On  the  other  hand,  array  processing  has  many  practical  problems,  such  as  unknown 
sensor  gain /phase  factors,  array  shape  calibration,  and  DOA  estimation  for  coherent  sources  in  col¬ 
ored  noise.  It  is  our  aim  to  develop  cumulant-based  solutions  to  those  practical  problems  that  still 
lack  reasonable  solutions  when  onlv  second-ordei  statistics  are  employed. 
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Moments  and  Wavelets  in 
Signal  Estimation 


Abstract:  The  problem  of  generalized  nonparametric  function  estimation  has 
received  considerable  attention  over  the  last  two  decades.  Most  of  the  approaches  have 
assumed  smoothness  of  the  function  to  be  estimated  generally  in  the  form  of  continuity 
of  higher  order  derivatives  and/or  bounded  variation  and  have  used  convolution  kernels 
or  splines  as  the  estimation  devices.  Generally  focus  has  been  on  density  estimation  or 
nonparametric  regression.  The  spline  and  kernel-based  methods  may  be  inappropriate  if 
either  smoothness  assumptions  tire  violated  or  if  additional  side  conditions  are  present. 
Wegman  (1984)  introduced  a  general  framework  for  optimal  nonparametric  function 
estimatic  a  which  applies  to  a  much  wider  class  of  problems  than  simply  density 
estimation  or  nonparametric  regression.  In  this  framework,  a  class  of  admissible 
estimators  is  regarded  as  a  compact,  convex  subset  of  a  Banach  function  space  and  a 
convex  objective  functional  is  to  be  optimized  over  this  set.  Recent  work  on  wavelets 
suggests  a  powerful  method  for  constructing  orthonormal  bases  to  span  the  set  of 
admissible  estimators.  Moreover,  older  work  on  frames  has  re-emerged  to  some  level  of 
prominence  because  of  the  work  on  wavelets.  The  optimal  estimates  can  be  computed 
as  weighted  linear  combinations  of  the  orthonormal  bases.  The  weight  coefficients  are 
computed  as  moments  of  the  basis  functions.  We  illustrate  these  methods  with  some 
numerical  examples. 
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Moments  and  Wavelets  in 
Signal  Estimation 


1.  Introduction. 

The  method  of  moments  is  a  time-honored  traditional  technique  in  statistical 
inference  while  wavelet  analysis  has  recently  burst  upon  the  mathematical  scene  to 
capture  the  enthusiasm  and  imagination  of  many  applied  mathematicians  and  engineers 
both  because  of  their  important  applications  in  signal  and  image  processing  and  other 
engineering  applications  and  also  because  of  the  inherent  elegance  of  the  techniques.  In 
this  paper  we  bring  these  tools  together  to  illustrate  their  application  to  transient  signal 
processing.  Wavelets  are  described  in  detail  in  a  number  of  locations.  Much  of  the 
fundamental  work  was  done  by  Daubechies  and  is  reported  in  Daubechies,  Grossmann 
and  Meyer  (1986)  and  Daubechies  (1988).  Heil  and  Walnut  (1989)  provide  a  survey 
from  a  mathematical  perspective  while  Rioul  and  Vetterli  (1991)  provide  a  survey  from 
a  more  engineering  perspective.  The  new  book  by  Chui  (1992)  is  an  excellent  integrated 
treatment,  which  I  believe  is  more  mathematically  sophisticated  than  the  author 
supposes.  In  spite  of  its  title  as  an  introduction,  it  requires  somewhat  more 
mathematical  depth  and  maturity  and  is  best  regarded  as  more  of  a  monograph. 

This  present  paper  describes  the  basic  wavelet  theory  in  the  context  of  the 
general  statistical  problem  of  nonparametric  function  estimation.  It  will  be  show  that 
traditional  moment  based  techniques  have  an  interesting  and  useful  connection  to 
modern  nonparametric  functional  inference  for  signal  processing  via  wavelets.  Wegman 
(1984)  describes  a  basic  framework  for  optimal  nonparametric  function  estimation.  This 
framework  captures  the  optimal  estimation  of  a  wide  variety  of  practical  function 
estimation  problems  in  a  common  theoretical  construct.  Wegman  (1984),  however,  only 
discusses  the  existence  of  such  optimal  estimators.  In  the  present  paper,  we  are 
interested  in  combining  this  optimality  framework  with  more  general  wavelet 
algorithms  as  computational  devices  for  general  optimal  nonparametric  function 
estimation.  A  new  application  of  optimal  nonparametric  function  est’mation  is  found  in 
Le  and  Wegman  (1991).  A  second  application  will  be  discussed  in  this  paper. 

In  section  2,  we  discuss  the  optimal  nonparametric  function  estimation 
framework.  In  section  4.  we  turn  to  a  discussion  of  the  general  function  analytic 
framework  which  leads  to  bases  and  frames.  Section  4  introduces  the  notion  of  a 
wavelet,  basis  and  demonstrates  the  connection  with  Fourier  series  and  Parseval's 


Theorem.  In  section  5  we  turn  to  transient  signal  estimation,  develop  an  optimization 
criterion  and  illustrate  the  computation  of  a  transient  signal  estimator. 

2.  Optimal  Nonpar ametric  Function  Estimation. 

Consider  a  general  function,  f(x),  to  be  estimated  based  on  some  sampled  data, 
say  Xj,  x9,. ..,xn.  This  is,  in  fact,  the  most  elementary  estimation  problem  in  statistical 
inference.  Often  the  function,  f,  in  question  is  the  probability  distribution  function  or 
the  probability  density  function  and  most  frequently  the  approach  taken  is  to  place  the 
function  within  a  parametric  family  indexed  by  some  parameter,  say  6.  Rather  than 
estimate  f  directly,  the  parameter  9  is  estimated  with  f^  then  being  estimated  by  f^  =  f-. 
Under  a  variety  of  circumstances,  it  is  much  more  desirable  to  take  a  nonparametric 
approach  so  as  to  avoid  problems  associated  with  misspecification  of  parametric  family. 
This  is  particularly  the  case  when  data  is  relatively  plentiful  and  the  information 
captured  by  the  parametric  model  is  not  needed  for  statistical  efficiency. 

Probability  density  estimation  and  nonparametric,  nonlinear  regression  are 
probably  the  two  most  widely  studied  rametric  function  estimation  problems. 

However,  other  problems  of  interest  which  immediately  come  to  mind  are  spectral 
density  estimation,  transfer  function  estimation,  impulse  response  function  estimation, 
all  in  the  time  series  setting,  and  failure  rate  function  estimation  and  survival  function 
estimation  in  the  reliability/biometry  setting.  While  it  may  be  the  case  that  we  simply 
may  want  an  unconstrained  estimate  of  the  function,  it  is  more  often  the  case  that  we 
wish  to  impose  one  or  more  constraints,  for  example,  positivity,  smoothness,  isotonicity, 
convexity,  transience  and  fixed  discontinuities  to  name  a  few  appropriate  constraints. 
By  far,  the  most  common  assumption  is  smoothness  and  frequently  the  estimation  is  via 
a  kernel  or  convulution  smoother.  We  would  like  to  formulate  an  optimal 
nonparametric  framework. 

We  formulate  the  optimization  problem  as  follows.  Let  %  be  a  Hilbert  space  of 
functions  over  R,  the  real  numbers  (or  C,  the  complex  numbers).  For  purposes  of  the 
present  paper,  we  assume  R  rather  than  C  unless  otherwise  specified.  The  techniques  we 
outline  here  are  not  limited  to  a  discussion  of  L .> ( R )  although  quite  often  we  do  take  lit 
to  be  L*>.  In  this  case,  we  take 

<  L  g  >  =  f(x)  g(x)  d//(x), 

where  p  is  Lebesgue  measure.  We  emphasize  that  this  is  not  absolutely  required.  As 


usual  ||  f  ||  =  \/  <  f ,  f  >  .  A  functional  i.:3fc-»IR  is  linear  if 

l(a {  +  0g)  =  aJf(f)  +  0l( g),  for  every  f,g  e3t  and  a,  0  G  R. 
i.  is  convex  on  S  C  %  if 

X(tf  +  (1  -  t)g)  <  ti.(f)  +  (1  -  t)i(g),  for  every  f,  g  €  S  with  0  <  t  <  1. 

t  is  concave  if  the  inequality  is  reversed.  L  is  strictly  convex  (concave)  on  S  if  the 
inequality  is  strict.  1  is  uniformly  convex  on  S  if 

ti-(f)  +  (1  -t)2(g)  -i.(tf+(l  -t)g)  >  ct(l  -t)  ||  f - g  ||  2 


for  every  f ,  g  €  S  and  0  <  t  <  1 . 

We  wish  to  use  JL  as  the  general  objective  functional  in  our  optimization 
framework.  For  example,  if  we  are  concerned  with  likelihood,  we  may  consider  the  log 
likelihood, 

n 

1(f)  =  log  f(x,),  x,  are  a  random  sample  from  f. 
i=l 

If  we  have  censored  samples  we  may  wish  to  consider 


A(g)  =  Y.  6i  lo8  g(xi)+  Y  (1  ~&i)  log  G(x,-)i 

i=l  i = 1 

X-  again  a  random  sample,  8-  a  censoring  random  variable,  G  =  1  -  G,  and 
x 

G(x)  =  |  g(u)  du.  This  is  the  censored  log  likelihood.  Another  example  is  the 
—  oo 

penalized  least  squares.  In  this  rase 


b 

■t(g)  =  Y  (>'i-g(xi))2  +  A[(Lg(u))2  dn- 

i  =  1  a 


Here  L  is  a  differential  operator  and  the  solution  of  this  optimization  problem  over 
appropriate  spaces  is  called  a  penalized  smoothing  L-spline.  If  L  =  D?  then  the  solution 
is  the  familiar  cubic  spline. 

The  basic  idea  is  to  construct  ScK  where  S  is  the  collection  of  functions,  g. 
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which  satisfy  our  desired  constraints  such  as  smoothness  or  isotonicity.  We  wish  tc 
optimize  2(g)  over  S.  The  optimized  estimator  will  be  an  element  of  S  and  hence  will 
inherit  whatever  properties  we  choose  for  S.  The  estimator  will  optimize  2(g)  and 
hence  will  be  chosen  according  to  whatever  optimization  criterion  appeals  to  the 
investigator.  In  this  sense  we  can  construct  designer  estimators,  i.e.  estimators  that  are 
designed  by  the  investigator  to  suit  the  specifics  of  the  problem  at  hand. 

Of  course,  in  a  wide  variety  of  rather  disparate  contexts,  many  of  these 
estimators  are  already  known.  However,  they  may  be  proven  to  exist  in  a  general 
framework  according  to  the  following  theorem. 

Theorem  2.1: 

Consider  the  following  optimization  problem: 

Minimize  (maximize)  2(f)  subject  to  f  e  S  C  It. 

Then 

a.  If  3t  is  finite  dimensional,  2  is  continuous  and  convex  (concave)  and  S  is  closed 
and  bounded,  then  there  exists  at  least  one  solution. 

b.  If  %  is  infinite  dimensional,  2  is  continuous  and  convex  (concave)  and  S  is 
closed,  bounded  and  convex,  then  there  exists  at  least  one  solution. 

c.  If  2  in  a.  or  b.  is  strictly  convex  (concave),  the  solution  is  unique. 

d.  If  It  is  infinite  dimensional,  2  is  continuous  and  uniformly  convex  (concave) 
and  S  is  closed  and  convex,  then  there  exists  a  unique  solution. 

Proof:  A  full  proof  is  given  in  Wegman  (1984).  For  completeness,  we  outline  the  basic 
elements  here.  a.  For  the  finite  dimensional  case,  S  closed  and  bounded  implies  that  S 
is  compact.  Choose  fn  €  S  such  that  2(fn)  converges  to  inf{2(f):  feS}.  Because  of 

compactness,  there  is  a  convergent  subsequence  fn  having  a  limit,  say  f*.  By 

k 

continuity  of  2 

i(f,)=  lim  2(f  )  =  inf{2(f):feS}. 

k-*oc  k 

f,  is  the  required  optimizer  For  part  b.,  w’e  have  the  same  basic  idea  except  that  S 
closed,  bounded  and  convex  implies  that  S  is  weakly  compact.  We  use  the  weak 
continuity  of  2.  Uniqueness  follows  by  supposing  both  f*  and  f,,  are  both  minimizers. 
Then 

2(tf,  +  (1  -  t)f„)  <  t2(f.)  +  (1  -  t)2(0)  =  inf { 2(f):  f  6  S}. 
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This  implies  that  neither  f„  nor  f**  is  a  minimizer  which  is  a  contradiction.  □ 

This  theorem  gives  us  unified  framework  for  the  construction  of  optimal 
nonparametric  function  estimators.  It  does  not,  however,  give  us  a  definitive  method 
for  construction  of  nonparametric  function  estimators.  We  give  a  constructive 
framework  in  the  next  several  sections.  In  closing  this  section  we  refer  the  reader  to 
Wegman  (1984)  for  the  complete  proof  of  Theorem  2.1  and  many  more  examples  of  the 
use  of  this  result. 

3.  Bases  and  Subspaces. 

In  this  section,  we  discuss  the  basic  theory  of  spanning  bases  and  their 
application  to  function  estimation.  Consider  f,  g  £  M.  f  is  said  to  be  orthogonal  to  g 
written  f  _L  g  if  <  f,  g  >  =0.  An  element  f  is  normal  if  )|  f  ||  =  1.  A  family  of  elements, 
say  {e^:  A  €  A}  is  orthonormal  if  each  element  is  normal  and  if  for  any  pair  ej,  e 2  in  the 
family,  eL  ±  e2.  A  family  {e^:  A  e  A}  is  complete  in  S  C  %  if  the  only  element  in  S  which 
is  orthogonal  to  every  e^,  A  e  A  is  0.  A  basis  or  base  of  S  is  a  complete  orthonormal 
family  in  S.  A  Hilbert  space  has  a  countable  basis  if  and  only  if  it  is  separable,  i.e.  if 

and  only  if  it  has  a  countable  dense  subset.  Ordinary  Lp  spaces  are  separable.  We  are 

now  in  a  position  to  state  the  basic  result  characterizing  bases  of  Hilbert  spaces  or 
subspaces.  We  write  span({eA})  to  be  the  minimal  subspace  containing  {e^}.  This  is 
the  space  generated  by  the  elements  {e^}. 

Theorem  3.1: 

Let  %  be  a  separable  Hilbert  space.  If  { e ((.}^°_  j  is  an  orthonormal  family  in  D€, 
then  the  following  are  equivalent. 

a.  {e^}^  j  is  a  ba>b  for 

b.  If  f  e  K  and  f  1  e^  for  every  k,  then  f  =  0. 

c.  If  f  €  then  f  =  £  <  f,  e^  >  efc.  (orthogonal  series  expansion) 

k  =  1  *jo 

d.  If  f,  g  €  3G,  then  <  f.  g  >  -  £  <  f,  e^.  >  <  g,  e*  >  . 

,,  OG  k  =  1  .1 

e.  If  f  e  ||  f  ||  "  =  Y,  I  <  L  e,  >  |  ( Parseval’s  Theorem) 

_  k  =  i  k 

Proof: 

a  =>  1>:  Trivial  by  definition. 

b  =>  c:  W<*  claim  Dfc  =  :q>an(  {e^}).  If  not  there  is  f  /  0,  fgK  such  that 
f  £  span({ej.}).  This  implies  that  f  1  e^  for  every  k.  But  f  _L  e^  for  every  k  and  f  ^  0  is  a 
contradiction  to  the  { «q. }  being  a  basis.  Let  %).  =  span(e^).  Then  !K>  =  span(  ^u^^)  = 
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£  lc%lc.  This  implies  that  for  f  e  3£, 


OO 

(3.1)  f=j:  cfce k. 

Substituting  (3.1)  in  the  expression  for  the  inner  product  yields 

OO 

<  f,  ej  >  =  <  £  cfc  efc,  ej  >  =  £ct  <  efc,  e;-  >  . 

By  the  orthonormal  property,  <e£,  e  •  >  =  1,  if  k  =  j  and  =  0,  otherwise.  It  follows  that 
<  f,  e]  >  =  Cj.  Thus 


OO 

(3.2)  f  =  £  <  f,  e^  >  e.. 

k  =  i 

OO  OO 

c  =>  d:  <  f,  g  >  =  <  f,  £  <  g,  eA  >  efc  >  =  £  <  g,  et  >  <  f,  et  >  . 

d  =  >  e:  Let  f  =  g  in  part  d. 

e  =>  a:  If  f  e  and  f  1  for  every  k  implies  <  f ,  efc  >  =0  for  every  k.  This  in 

turn  implies  that  ||f  ||  =  0.  Thus  f  =  0.  This  finally  implies  {e^}^  is  a  basis.  □ 

OO 

Thus  given  any  basis  we  can  exactly  write  f  =  £  ck  ek  we  can 

estimate  f  by  £  c^  e^.  Thus  a  computational  algorithm  for  the  optimal  nonparametric 
function  estimator  can  be  based  on  this  result  from  Theorem  3.I.C.  However,  this  does 
not  yet  take  into  account  the  “design”  set,  S.  In  order  to  more  carefully  study  the 
structure  of  S  we  consider  the  following  result.  In  the  following  discussion  let  S  C  %. 
Then  define  S~L={f€%:  flS}. 

Theorem  3.2: 

If  S  C  %  is  a  subset  of  3G,  then 

a.  S  ■*"  is  a  subspace  of  K  and  S  nS  ^  C  {0} 

b.  S  C  S  1  1  =  span(S) 

c.  S  is  a  subspace  if  and  only  if  S  =  S  1  1  . 

Proof:  $  is  a  linear  manifold.  To  see  this  if  fj,  f0  G  S  ,  then  for  every  g€S, 
<  a[fj  +  a2f2,  g  >  =  aj  <  fj,  g  >  +  a2  <  f2,  g  >  =  aj  •  0  +  a2  •  0  =  0.  Thus  +  a2f2  G  S  1 . 
This  implies  S  ^  is  a  linear  manifold  which  is  sufficient  to  show  that  S  ^  is  a  subspace 
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provided  we  can  show  S  "*■  is  closed.  To  see  this  if  f  e  closure  (S  ),  then  there  exists 
{fn}  cS  1  such  that  f  =  lim  fn  and  for  every  g  e  S,  <  f„,  g  >  =  0.  But  <  f,  g  >  = 
lim  <  fn,  g>  =  lim  0  =  0.  This  implies  f  J_  S  which  in  turn  implies  f  €  S  .  Part  b 

follows  from  part  a  by  replacing  S  by  S  .  Part  c  is  straightforward  application  of  the 

two  previous  parts.  0 

Suppose  now  that  we  have  a  basis  for  !Hj,  call  it  {ejJ^X  j.  This  basis  obviously 
also  spans  subset  S  of  !K>  and  hence  any  of  our  "‘designer”  functions  in  S  can  be  written 
in  terms  of  the  basis,  { e ^ } ^°_  ,.  The  unnecessary  basis  elements  will  simply  have 
coefficients  of  0.  In  a  sense,  however,  this  basis  is  too  rich  and  in  a  noisy  estimation 
setting  superfluous  basis  elements  will  only  contribute  to  estimating  noise.  As  part  of 
our  “designer”  set,  S,  philosophy,  we  would  like  to  have  a  minimal  basis  set  for  S. 
Theorem  3.2  gives  us  a  test  for  this  condition.  Consider  a  basis  {e^}^  x  for  3fc.  Form 
Bs  which  is  to  be  a  basis  for  S.  We  define  Bs  by  the  following  routine.  If  there  is  a 
geS  such  that  <  g,  ej.  >  ^  0,  then  let  e^eBs.  If  on  the  other  hand  there  is  a 

g  £  S  ■*“  such  that  <  g,  e^  >  ^  0,  then  let  e  B?  x  .  Unfortunately,  it  may  not  be  that 

B.s n B^  l  =  0.  But  this  algorithm  yields  {et}  =  BsuBs  x  .  Moreover  Scspan(B5). 
Thus  we  may  be  able  to  eliminate  unnecessary  basis  elements.  We  may  also  be  able  to 
re-normalize  the  basis  elements  using  a  Gram-Schmidt  orthogonalization  procedure  to 
make  B^lB^^  .  Usually  if  we  know  the  properties  of  the  set,  S,  we  desire  and  the 
nature  of  the  basis  set  {e^.},  it  will  be  straightforward  to  construct  a  test  function,  g, 
with  which  to  construct  the  basis  set,  B5.  If  S  is  a  subspace,  then  S  =  span(B5).  In  any 
case  we  can  carry  out  our  estimation  by 

(3.3)  f  = 

ek*B, 

In  a  completely  noiseless  setting  (3.1)  is  really  an  equality  in  norm,  i.e. 
||  f-  X  kckek  II  —  0-  If  is  L with  fi  Lebesgue  measure,  then  (3.1)  is  really 

(3.4)  f  =  XfcrUT'  aIinost  everywhere  fi  with  c^  =  <  f,  e^  >  . 

This  choice  of  c ^  is  a  minimum  norm  choice.  However,  in  a  noisy  setting,  i.e.  where  we 
do  not  know  f  exactly,  we  cannot  compute  c^  directly.  However,  we  may  be  able  to 
estimate  c^  by  standard  inference  techniques. 
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Example  3.1.  Norm  Estimate.  The  minimum  norm  estimate  of  is  the  choice  which 
minimizes  ||  f -  £  ifCkek  II  ’  i-e-  ck  =  <  ^  ek  >  •  ^2  context> 

<  f,  e*  >  =  jf(x)  et(x)  d/i(x). 

R 

If  f  is  a  probability  density  function,  then  <  f,  e^  >  =  Eje^]  which  can  simply  be 
estimated  by  n”  1  ,e^(x  •),  where  Xj,  j  =  l,...,n  is  the  sample  of  observations.  We 

note  that  the  major  approach  to  estimating  the  weighting  coefficients  is  via  a  traditional 
method  of  moments. 

Example  3.2.  General  Form  of  Estimate.  In  the  general  context  with  optimization 
functional  L  we  have 

(3.5)  =  E 

Since  (3.5)  is  a  function  of  a  countable  number  of  variables,  {c^},  we  can  find  the 
normal  equations  and  with  the  appropriate  choice  of  basis,  find  a  solution.  For  this  we 
will  typically  assume  L  is  twice  differentiable  with  respect  to  all  c^.  A  wide  variety  of 
bases  have  been  studied.  These  include  Laguerre  polynomials,  Hermite  polynomials  and 
other  orthonormal  systems.  Perhaps  the  most  well-known  orthonormal  system  is  the 
system  of  fundamental  sinusoids  which  span  L2(0,  27r).  One  might  reasonable  guess 
that  wavelets  form  another  orthogonal  system.  We  discuss  the  connection  in  the  next 
section. 

4.  Fourier  Analysis  and  Wavelets. 

4.1  Bases  for  L2(0,  2r). 

Let  us  consider  the  set  of  square-integrable  functions  on  (0,  2tt)  which  we  denote 
by  L2(0,  27t).  L2(0,  27t)  is  a  Hilbert  space  and  a  traditional  choice  of  an  orthonormal 
basis  for  this  space  has  been  e^(x)  =  eikx,  the  complex  sinusoids.  Thus  any  f  in  L2(0,2x) 
has  the  Fourier  representation  by  Theorem  3.1.c 

o° 

«*)=  £ 

it  =  —  oo 

where  the  constants  c^.  are  the  Fourier  coefficients  defined  by 
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2ir 

c*=iIfWe_,*xdx- 

0 

This  pair  of  equations  represent  the  discrete  Fourier  transform  and  the  inverse  Fourier 
transform  and  is  the  foundation  of  harmonic  analysis.  An  interesting  feature  of  this 
complex  sinusoids  as  a  base  for  Lo(0,  2 ir)  is  that  e^(x)  =  e^x  can  be  generated  from  the 
superpositions  of  dilations  of  a  single  function,  e(x)  =  elx .  By  this  we  mean  that 

e*(x)  =e(fac),  *  =  ••-,  -1,  0,  1,  ••• 

These  are  integral  dilations  in  the  sense  that  k  e  J,  the  integers.  The  concept  of 
dilations  of  a  fixed  generating  function  is  central  to  the  formation  of  wavelet  bases  as  we 
shall  see  shortly. 

A  well  known  consequence  of  Theorem  3.1.e  for  the  complex  sinusoid  basis  is  the 
Parseval  Theorem.  For  this  base,  we  have 

Theorem  4.1:  (Parseval’s  Theorem): 

r '2  it  „  oo 

(4.1)  l|f||“=  | f(x)  | 2  dx  =  ]T  \ck\ 2 

0  k  =  —  oo 

Equation  (4.1)  is  known  as  Parseval’s  Theorem  in  harmonic  analysis  and  states  that  the 
square  norm  in  the  frequency  domain  is  equal  to  the  square  norm  in  the  time  domain. 

While  the  space  L2(0,  2tt)  is  an  extremely  useful  one,  for  general  problems  in 
nonparametric  function  estimation  we  are  much  more  interested  in  L2(R).  We  can 
think  of  L2(0,  27t)  as  with  functions  on  the  finite  support  (0,  2n)  or  as  periodic  functions 
on  R.  In  the  latter  case  it  is  clear  that  the  infinitely  periodic  functions  of  L2(0,  2?r)  and 
the  square  integrable  functions  of  L2(R)  are  very  different.  In  the  latter  case  the 
function,  f(x)  €  L2(R),  must  converge  to  0  as  x-*±oo.  The  generating  function  e(x)  =  etx 
clearly  does  not  have  that  behavior  and  is  inappropriate  as  a  basis  generating  function 
for  L2(R).  What  is  needed  is  a  generating  function,  e(x),  which  also  has  the  property 
that  e(x)-*0  as  x-»±oo.  Thus  we  want  to  general  ■  a  basis  from  a  function  which  will 
decay  to  0  relatively  rapidly,  i.e.  we  want  little  waves  or  wavelets. 
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4.2  Wavelet  Bases. 


Let  us  begin  by  considering  a  generating  function  xp  which  we  will  think  of  as  our 
mother  wavelet  or  basic  wavelet.  The  idea  is  that,  just  as  with  the  sinusoids,  we  wish  to 
consider  a  superposition  of  dilations  of  the  basic  waveform  xp.  For  technical  convergence 
reasons  which  we  shall  explain  later  we  wish  to  consider  dyadic  dilations  rather  than 
simply  integral  translations.  Thus  for  the  first  pass,  we  are  inclined  to  consider 
V\-(x)  =  2J^t/>(2^“x).  Unfortunately,  because  of  the  decay  of  xp  to  0  as  x-*±oo,  the 
elements  {xpj}  are  not  sufficient  to  be  a  basis  for  L2(R).  We  accommodate  this  by 
adding  translates  to  get  the  doubly  indexed  functions  xp j  ^(x)  =  2-?/2t/’(2Jx  -k).  We 
choose  xp  such  that 

f  |  xp(tj)  |  2 

- U - dw  exists. 

■>  R 

Here  xp  is  the  Fourier  transform  of  xp.  Under  certain  choices  of  xp,  xp  ■  .  forms  a  doubly 
indexed  orthonormal  basis  for  L2  (actually  also  for  Sobolev  spaces  of  higher  order  as 
well).  As  we  shall  see  in  the  next  section,  a  wavelet  basis  due  to  the  dilation-translation 
nature  of  its  basis  elements  admits  an  interpretation  of  a  simultaneous  time-frequency 
decomposition  of  f.  Moreover  using  wavelets,  fewer  basis  elements  are  required  for 
fitting  sharp  changes  or  discontinuities.  This  implies  faster  convergence  in  “non¬ 
smooth”  situations  by  the  introduction  of  “localized”  basis  elements. 

Example  3.1  Continued:  Notice  that 

Cj  k  -  <  f,  xpj^  >  =  |  2J//2^(  2;x-k)f(x)  dx. 

In  the  density  estimation  case 

4-k)). 

Thus  a  natural  estimator  is 

xM2\-k). 

i  =  1 

where  xf-,  i  =  l,...,n  is  the  set  of  observations.  Again  we  are  simply  using  a  method  of 
moments  estimator. 

Notice  that  we  can  construct  a  Parseval’s  Theorem  for  Wavelets. 
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Theorem  4.2:  (Parseval’s  Theorem  for  Wavelets) 

,oo  oo  oo  oo  oo 

(4.2)  ||f||2  =  |f(x)|2dx=£  £  lcj,fcl2=£  £  I c  j,  Jfc 1  2 

J  —  oo  j  =  —  oo  k  —  —  oo  k  =  —  oo  j  =  —  oo 

At  this  stage  we  are  left  with  the  problem  of  constructing  an  appropriate  mother 
wavelet,  0,  suitable  for  constructing  the  basis.  To  do  this  we  turn  to  the  device  of 
multiresolution  analysis. 


4.3  Multiresolution  Analysis. 

To  understand  multiresolution  analysis  let  us  first  consider  the  construction  of 
space  W  -  =  span{i/>j  j.:  fceJ}.  That  is  we  fix  the  dilation  and  consider  the  space 
generated  by  all  possible  translates.  We  may  write  L9(R)  as  a  direct  sum  of  the  W  •, 
L,(R)  =  Y,  so  that  any  function  f  €  L2(R)  may  be  written  as 
jeJ  f(x)  =  ---  +  d_1(x)  +  dfl(x)  +  d1(x)  +  — 

where  d.  e  W,.  If  t/>  is  an  orthogonal  wavelet,  then  W  iW.,  k^j.  We  shall  assume 
the  unknown  V’  to  be  an  orthogonal  wavelet  in  what  follows.  Notice  that  as  j  increases, 
the  basic  wavelet  form  t/>(2Jx-fc)  contracts  representing  higher  “frequencies.”  For  each 
j  we  may  consider  the  direct  sum  V  •  given  by: 

v,-  =  -  +  W  _,  +  w._,=  £  wm. 

771  —  —  OO 


The  V  •  are  closed  subspaces  and  represent  spaces  of  functions  with  all  “frequencies”  at 
or  below  a  given  level  of  resolution.  The  set  of  spaces  |Vjj  has  the  following  properties: 


1)  They  are  nested  in  the  sense  that  V j  C  +  j,  je  J. 

2)  Closure  j V  j)  =  L9(R). 

3)  =’<<»•  ' 

4)  Vj+l=V,  +  W^ 

5)  f(x)  €  V j  if  and  only  if  f(2x)  e  Vj  +  j,  je  J. 

1),  4)  and  5)  follow  directly  from  the  definition  of  V  ■.  2)  is  a  straightforward  conse¬ 
quence  of  the  fact  that  W-  =  L.,(R).  3)  follows  because  of  the  orthogonality 

property. 

Any  f  €  L2(R)  can  be  projected  into  Vj.  As  we  have  seen  with  j  increasing  the 
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the  “frequency”  of  the  wavelet  increases  which  can  be  interpreted  as  higher  resolution. 
Thus  the  projection,  Pjf,  of  f  into  V  ■  is  an  increasingly  higher  resolution  approximation 
to  f  as  j~*oo.  Conversely,  as  j-*-  oo,  P^f  is  an  increasingly  blurred  (smoothed)  approxi¬ 
mation  to  f.  We  shall  take  Vg  as  the  reference  subspace.  Suppose  now  that  we  can  find 
a  function  <j>  and  that  we  can  define  <t>-  ^(x)  =  2-,/2</>(2Jx  -  k)  such  that 


V0  =  span{<£0  jfc:  ke  J}. 

Then  by  property  5),  V  j  =  span{(pj  fee  J}.  While  we  began  our  discussion  with  the 
notion  of  wavelets  and  have  seen  some  of  the  consequences,  we  could  have  actually 
begun  a  discussion  with  the  function  <f>. 

Definition.  A  function  4>  generates  a  multiresolution  analysis  if  it  generates  a  nested 
sequence  of  spaces  having  properties  1),  2),  3)  and  5)  such  that  {<£g  ke  J}  forms  a 
basis  for  Vg.  If  so,  then  (f>  is  called  the  scaling  function. 

For  the  final  discussion  of  this  section,  let  us  consider  a  multiresolution  analysis 
in  which  { V^-}  are  generated  by  a  scaling  function  4>  e  L2(R)  and  {W^}  are  generated  by 
a  mother  wavelet  function  t/>gL2(R).  Any  function  f  e  L2(R)  can  be  approximated  as 
closely  as  desired  by  fm  for  some  sufficiently  large  mg  J.  Notice  fm  =  fm_1+dm_1 
where  fm  _  j  e  Vm  _  j  and  dm  _  j  c  Wm  _  j.  This  process  can  be  recursively  applied  say  l 

times  until  we  have  f  =  fm  =  dm  _  }  +  dm  _  2  h - (-  dm  _  /  +  fm  _  Notice  that  f m  _  /  is  a 

highly  smoothed  version  of  the  function.  Indeed,  this  suggests  that  a  statistical 
procedure  might  be  to  form  a  highly  smoothed  (even  overly  smoothed)  approximation 
to  a  function  to  be  estimated.  The  sequence  dm  _  ^  through  dm  _  j  form  the  higher 
resolution  wavelet  approximations.  Many  of  the  wavelet  coefficients  cm_t-  ^  used  for 
constructing  dm_  -,  i=  1,...,  I  are  likely  to  be  0  and  hence  can  contribute  to  a  very 
parsimonious  representation  of  the  function  f.  Indeed,  a  wavelet  decomposition  is  a 
natural  suggestion  for  a  technology  for  high  definition  television  (HDTV).  If  fm_j 
represents  the  lower  resolution  conventional  NTSC  TV  signal,  then  to  reconstruct  a 
high  resolution  image  all  that  is  needed  is  the  difference  signal  which  could  be 
parsimoniously  represented  by  the  wavelet  coefficients  cm  _  t=  l,...,l  and  fee  J,  most 
of  which  would  be  0. 

Most  importantly,  however,  is  the  observation  that  the  scaling  function  0  e  Vg 
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and  the  mother  wavelet  xp  e  Wy  implies  that  both  are  in  Vj.  Since  V|  is  generated  by 
^(x)  =  ‘2^<f>(2x  -  k),  there  are  sequences  (g(&)}  and  {h(fc)}  such  that 

(4.3)  <f>(x)  =  J2  gW(2x  - k )  and  0(x)  =  M*#(2x  ~  k)- 

kej  k€J 

This  remarkable  result  gives  us  a  construction  for  the  mother  wavelet  in  terms  of  the 
scaling  function.  These  equations  are  called  the  two-scale  difference  equations.  We  can 
give  a  time  series  interpretation  to  these  equations.  Lets  consider  an  original  discrete 
time  function,  f (n),  to  which  we  apply  the  filter 

y(n)  =  g(*)f(2n-*)- 

keJ 

First  of  all  we  note  that  there  is  a  scale  change  due  to  subsampling  by  two,  i.e.  a  shift 
by  two  in  f(n)  results  in  a  shift  of  one  in  y(n).  The  scale  of  y  is  only  half  that  of  f. 
Otherwise  this  is  a  low  pass  filter  with  impulse  response  function  g.  Let  us  consider 
iterating  this  equation  so  that 

(4.4)  y(j\n)  =  Y,  g(fc)y(“'~  1)(2n-fc). 

keJ 

Notice  that  if  this  procedure  converges,  it  converges  to  a  fixed  point  which  will  be  <p. 
This  iterative  procedure  with  repeated  down  sampling  by  two  is  suggestive  of  a  method 
for  constructing  wavelets.  If  g  is  a  finite  impulse  response  (FIR)  filter  of  length  /,  the 
construction  of  a  complementary  high-pass  filter  is  accomplished  with  a  FIR  filter,  h, 
whose  impulse  response  is  given  by  h(I-  1  -  n)  =  (  -  l)n  g(n).  This  scheme  is  called  sub¬ 
band  coding  in  the  electrical  engineering  literature.  The  low-pass  band  is  given  by 

(4.5)  y0(n)  =  Y.  g(k){(2n~k) 

k€J 

while  the  high-pass  band  is  given  by 

(4.6)  Vjfn)  =  y  h(k){(2n-  k). 

k£j 

The  filter  impulses  as  defined  form  an  orthonormal  set  so  that  the  f  may  be 
reconstructed  by 
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(4.7) 


f(n)  =  Y  [yo(k)&(2k~  n)  +Vi{k)H2k- «)]. 

k£j 

The  sub-band  coding  scheme  may  be  repeatedly  applied  to  form  the  nested  sequence  pf 
V  •.  The  nested  sequence  of  { V ^ }  is  then  essentially  obtained  by  recursively 
downsampling  and  filtering  a  function  with  a  low-pass  filter  whose  impulse  response 
function  is  g(  ■ ). 

4.4  Construction  of  Scaling  Functions  and  Mother  Wavelets. 

We  have  already  hinted  that  the  scaling  function  may  be  constructed  as  the 
fixed  point  of  the  down-sampled,  low-passed  filter  equation  (4.4).  This  can  be 
formalized  by  considering  what  statisticians  would  call  the  generating  function  of  g(n) 
and  what  electrical  engineers  call  the  z-transform  of  g(  ■ ). 

(4-8)  G(z)  =  b  Y  g(j)  zJ- 

ieJ 

Notice  if  z  =  e~IW/2,  then  (4.8)  is  essentially  the  Fourier  transform  of  the  impulse 
response  function  g(  • ).  In  this  case,  the  first  equation  in  (4.3)  may  be  written  as 

(4.9)  <£(u >)  =  G(z)<^^,  with  z  =  e  _  ,w/2. 

This,  of  course,  follows  because  the  Fourier  transform  of  a  convolution  is  the 
corresponding  product  of  the  Fourier  transforms.  This  recursive  equation  may  be 
iterated  to  obtain 

(4.10)  0(u)=  n  G(e“,w/' M0(°). 

k  =  1 

We  may  take  <p  to  be  continuous  and  d(0)  =  1.  Based  on  (4.10)  we  may  recover  *{  • ) 
and  based  on  this  result,  the  equation  h(I-  1  -  n)  =  (  -  1)7;  g(n)  and  the  second  equation 
of  (4.3)  we  may  recover  the  mother  wavelet,  0(  ■  ).  Thus  Daubechies'  original 
construction  shows  that  wavelets  with  compact  support  can  be  based  on  finite  impulse 
response  filters  which  was  originally  motivated  by  multiresolution  analysis.  Theorem 
4.3  below  summarizes  the  general  form  of  Daubechies'  result. 
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Theorem  4.3:  (Daubechies’  Wavelet  Construction): 

Let  g(n)  be  a  sequence  such  that 

a.  £  |  g(n)  |  |  n\ (  <  oo  for  some  e  >  0, 
n  €  J 

b.  £  g(»-2j)  g(n-2k)  =  6jk, 

n  6  J 

c.  Z  «(»)  =  !• 
n  £  J 

Suppose  that  g(uj)  =  G(e  _  ,w^)  =  2~  g(u)  e ~  inLJ/^  can  be  written  as 

n  e  J 

*(«)  =  [  J(1 +  e-i“/2n.(  E  f(»)e-"“/2] 

,  neJ 

where 

d.  Z  I  Hn)  I  |  n  | (  <  oo  for  some  e  >  0 

e-  SUK  e  r  I  E  nf(n)  e  ~ ,nw/2 1  <  ~  1  - 

Define 

h(n)  =  (  -  l)n  g(  —  n  +  1 ) , 

0(x)  =  h{k)<t>(2x-  k). 

k£J 

Then  the  orthonormal  wavelet  basis  is  ipj^.  determined  by  the  mother  wavelet  rj>. 
Moreover,  if  g(n)  =  0  for  |n|  >tiq,  then  the  wavelets  so  determined  have  compact 
support.  □ 

We  state  this  result  without  proof  which  may  be  found  n  Daubechies  (1988).  We 
note  that  Daubechies  also  shows  that  the  mother  wavelet,  0,  cannot  be  an  even  function 
and  also  have  a  compact  support.  The  exception  to  this  is  the  trivial  constant  function 
which  gives  rise  to  the  so-called  Haar  basis.  Daubechies  illustrates  this  computation 
with  the  example  of  g  given  by  g(0)  =  (1  +  \/3)/8,  g(l)  =  (3  +  \/3)/8,  g(2)  =  (3  -  >/3)/8 
and,  finally,  g(3)  =  (1  -  \/3)/8.  This  wavelet  is  illustrated  in  Figure  4.1. 


286 


- 1  - - - - - - —  - - 

0  05  1  15  2  25  3  3  5 

Figure  4.1b.  Daubechies’  Mother  Wavelet  using  4-t< 
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5.  Transient  Signal  Function  Estimation. 

Now  with  the  basic  construction  of  wavelets  in  hand,  we  can  turn  to  the 
transient  signal  processing  application.  Wavelets  have  as  one  of  their  prime 
applications  transient  signal  processing.  In  particular,  since  the  most  effective  wavelets 
are  those  with  compact  support,  they  are  a  natural  basis  for  transient  signal  estimation. 
However,  if  we  are  to  exploit  them  in  the  context  of  optimal  nonparametric  function 
estimation,  we  must  construct  an  optimality  criterion  for  transient  signals.  The 
discussion  below  outlines  an  approach  to  transient  signal  estimation  set  in  the  context  of 
optimal  nonparametric  function  estimation.  A  fuller  treatment  can  be  found  in  Le  and 
Wegman  (1992).  We  first  consider  signals.  It  is  well-known  that  there  is  no  non-zero 
function  in  L2(R)  which  is  both  band- limited  and  time- limited.  This  being  the  case,  we 
will  assume  the  signal  to  be  hard  band-limited,  i.e.  with  no  energy  outside  a  fixed 
interval,  say  [-1/,  j/],  but  soft  time-limited,  i.e.  with  minimal  energy  in  the  tails.  This 
particular  example  demonstrates  an  elegant  application  of  moments  to  signal  processing. 

5.1  Measuring  of  Out-of-Band  Energy 

Let  L2(R)  be  the  set  of  square-integrable,  real-valued  functions  and  let 
h(t)eL2(R).  Denote  by  f(u;)  the  Fourier  transform  of  f(t)  such  that  f  g  L2(R).  We 
assume  f  is  frequency  band-limited  so  that  f(u;)  =  0,  for  \u\>v.  We  propose 
approximating  the  class  of  band-limited  time-transient  functions  by  considering 
functions  whose  energy  time  spread  is  confined  to  some  small  level  s0.  As  a  measure  of 
the  energy  time-spread,  we  will  use  analogies  to  concepts  from  probability  theory  to 
define  various  moments  of  |f(t)|2,  which  plays  the  role  of  the  energy  distribution 
function.  Assuming  that 

OO 

|  |tHf(t)|2dt  <  OO,  j  =  l,  2,  ...  ,  k, 

—  OO 

the  ktA  moment  of  the  energy  distribution  will  now  be  defined  as  follows 

OO 

M*  =  |  t*  |f(t)  | 2  dt. 

—  OO 

For  k  =  2,  we  have  the  2nd  moment  of  the  energy  distribution  function  as  a  measure  of 
the  energy  time  spread,  given  as 
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M2=  |  t2  I  f(t)  1 2  dt. 

—  OO 

Remark:  The  factor  tk  serves  as  a  weight  on  the  energy  function  which  is  used  to 
control  the  degree  of  spreading  in  |f(t)  | .  A  larger  k  value  implies  that  more  weight  is 
applied  at  the  tail-end  of  the  energy  distribution  function  and,  therefore,  the  process  of 
minimizing  requires  that  more  energy  be  centrally  concentrated. 

5.2  Optimal  Estimation  of  Band- Limited  Processes 

For  -  v  and  v  real  numbers,  and  m  and  p  integers,  where  -oo  < 
m>0  and  p  >  1,  the  Sobolev  space  ‘Wm,p[-t/,  u\  of  complex-valued 
( - 1/,  v\  is  given  by: 

u]  =|f(u;):  ?*>(«)  ,  k  =  0,  1,  ...  ,  m-1,  are  absolutely  continuous 
v 

and,  |  j  f(m\u;)|p  du>  <  oo}. 

—  v 

We  consider  observing  an  actual  process,  r(t),  and  we  let  r(u>)  be  the  Fourier  transform 
of  the  observed  process,  r(t).  The  Fourier  transform  of  the  observed  process,  r(t),  will 
then  be  modeled  as  r(u>)  =  g(u;)  +  £(w)  where,  £(u>)  is  the  spectrum  of  a  stationary  noise 
process,  g(u>)  €  v]  .  The  fact  that  f  belongs  to  the  class  Vfm'  2(  -  v,i/]  of  band- 

limited  signals  implies  that  the  support  of  |f(t)|2  is  not  bounded  The  objective  is, 
then,  to  find  a  function  f(u;)  €  Wm,2[- v,is]  which  best  fits  the  Fourier  transform  F(u>)  of 
the  observed  process  r(t)  with  minimum  time-energy  spread;  specifically  we  would  like 
to  minimize  the  following  functioned  with  k  <  m 

(5.1)  ^  min  [  J  (f(wJ)-r(u;i))2]  subject  to  f  t2fc  |  f(t)  | 2  dt  <  s0, 

f  eWm’2[-^,t/]  ■» 

where  f(t)  is  the  inverse  Fourier  transform  corresponding  to  f(u> )  in  Wm,2[-»',  v\. 

5.3  Moment  Connection  via  Parseval’s  Theorem 

A  rather  elegant  extension  of  Parseval’s  Theorem  can  be  constructed  under 
appropriate  regularity  conditions.  The  Parseval’s  Theorem  for  continuous  Fourier  trans¬ 
form  pairs  is 


—  oo,  and 

functions  f  on 
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But  we  know 


r  V  1  f 00  o 

J_  |f(w)|2  du  =  ^F|  Jf(t)l2dt. 

dt' 

Take  k^  derivatives  with  respect  to  u 

dki(u)  if00  /  -t\fc  f(t)e_‘<wdt 
"27tJ  zt)  t(t)e 

so  that 

f  ^k\u))  =  ^  is  the  Fourier  transform  of  (  -  it)k  f(t). 

d^k 

We  can  apply  Parseval’s  Theorem  to  this  Fourier  transform  pair  to  obtain 
Theorem  5.1: 

|f(fcV)|2du,  =  if°°  t2*|f(t))2dt.D 

Thus,  our  optimization  problem  (5.1)  can  now  be  reformulated  as 

v 

(5.2)  ^  min  {  ^  (f(a/j)  -  f(cJj))2  }  subject  to  [  |  f(fc)(u>)  | 2  du>  <  Sq. 

f€Vw,2( J=‘  -1/ 

Using  standard  Lagrange  multiplier  techniques,  this  in  turn  may  be  reformulated 
as 

(5.3)  _  min  ( ^ (f - r +  A  f  I2  du> ). 

£  G  ^.2|  j- 1  ij/ 

Indeed  expression  (5.3)  is  the  form  of  optimization  problem  which  results  in  a  solution 
which  is  a  generalized  polynomial  spline  of  degree  2&-1.  This  result  may  be 
substantially  generalized  by  the  theorem  given  below  which  is  developed  in  Le  and 
Wegman  (1992). 

Theorem  5.2:  Let  g(w)  be  a  band-limited  spectral  process  with  transient  inverse  Fourier 
transform  and  r(u>)  be  the  observed  spectral  process  defined  over  some  finite  band 
- 1/  <  w  <  1/.  We  model  this  spectral  process  as 
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?(w)  =  g(u)  +  {(w) 


where  £(ui)  is  some  stationary  white  noise  process.  Let  A  be  the  time  spread  measure, 
defined  as  follows: 

A(f)  =  a0Aj(f)  +  a1Afc(f) 
where,  +QO 

A*(f)=  £Jt“|f(t)|>dt, 

— oo 

and  where  and  a,  are  the  appropriately  chosen  weights.  Here  f  is  the  inverse  Fourier 
transform  of  f  belonging  to  L2(R).  Then,  the  optimal  band-limited  representation  in  the 
Sobolev  space  Wm,2[  -  is  fA(u>)  where  fA(a;)  is  the  solution  to  the  problem: 


minimize  +  aa(?)  . 

f  G  Vm,2[-i/,i/]  J=i 

f  A  is  a  generalized  L-spline,  and  A  is  known  as  the  smoothing  parameter.  □ 

For  a  general  discussion  of  L-splines,  see  Wegman  and  Wright  (1983).  Notice 
that  if  A(f)  =  Afc(f )  for  some  large  k,  then  we  are  constructing  a  band- limited  transient 
signal  estimator  with  little  energy  in  the  tail  of  the  signal  estimate,  fA,  where  fA  is  the 
inverse  Fourier  transform  of  f  A.  If  k  =  2,  then 

+oo  1/ 

a2(T)=  £jt‘|f(t)i’dt=J  i?|2)MiJ<w 


and  our  solution  is  the  well-known  cubic  spline.  However,  much  more  interesting  and 
physically  meaningful  solutions  may  be  found.  If  A(f)  =  a<)A0(f ) -f  a1At(f),  then  for  k 


odd 


+O0 

I f(t)  |  2  dt  +  a2 1  t2*|f(t)| 

—  OO 


Thus,  we  may  also  want  to  impose  a  total  energy  restriction  on  the  estimated  signal 
space.  This  imposed  restriction  may,  for  example,  have  resulted  from  a  requirement  to 
minimize  channel  bandwidth  utilization  from  data  transmission  systems.  Such 
modification,  thus,  yields  the  following  optimization  problem  for  k  odd 


.min  [  £  (f(u,,)  -  ?(*,))»  !  A,  f  |  ?M  |2  <k,  +  A,  f  |  |!  iu  ]. 

lev"1’1)-^)  i=i  -1/  -v 


Hence,  by  our  theorem  the  optimal  solution  is  again  an  L-spline. 

5.4  Computing  Band-limited  Transient  Estimators  and  Example 

The  rather  elegant  result  that  our  band-limited  transient  estimators  are 
generalized  L-splines  makes  the  numerical  computation  of  the  estimators  rather  more 
routine  since  algorithms  already  exist  for  computing  L-splines.  The  fact  that  we  can 
impose  total  energy  limits  as  well  as  tail-energy  limits  is  an  unexpected  bonus.  Our 
interpretation  of  Theorem  5.2  is  as  follows.  We  recommend  doing  an  initial  spectral 
estimation  to  establish  the  bandwidth,  -  v  <u  <  e,  over  which  we  want  to  estimate  g(u>) 
(or  more  precisely  the  signal,  g(t),  its  inverse  Fourier  transform).  This  initial  spectral 
estimate  will  also  allow  us  to  select  the  sampling  frequencies,  uy  We  recommend 
selecting  these  u>}  as  the  frequencies  with  the  largest  spectral  mass.  Notice  that  we  may 
regard  a  transient  signal,  g(t),  as  the  product  of  a  signal  of  infinite  support  with  an 
indicator  function  of  a  closed  interval.  It  is  well-known  that  Fourier  transform  of  an 
indicator  function  is  the  so-called  Diricldet  kernel  which  has  a  large  central  lobe  and 
decreasing  side  lobes.  By  choosing  sampling  frequencies  at  the  location  of  the  central 
and  side  lobes,  our  technique  allows  us  to  to  recover  the  indicator  to  an  excellent 
approximation.  Thus  not  only  do  we  estimate  the  transient  signal  because  of  the 
penalty  term  for  out-of-band  energy,  but  because  of  the  choice  of  sampling  frequencies 
as  well.  Figure  5.1  graphically  illustrates  the  results  of  our  technique. 
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Figure  5.1a.  A  two-cycle  transient  signal. 
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Figure  5.1b.  Same  two  cycle  signal  buried  in  Gaussian  noise. 
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Figure  5.1c.  Recovery  of  two-cycle  signal  waveform 
by  optimal  band-limited  techniques. 
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